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Preface 


This volume contains a selection of papers presented to the Third Interna- 
tional Conference on the L;-Norm and Related Methods, held in Neuchâtel, 
Switzerland, from August 11-15, 1997, as a Satellite Meeting to the 51st 
ISI Session in Istanbul. The conference included invited talks, contributed 
papers and a tutorial. A Summer School in Regression and Time Series 
Analysis for young graduate students and research workers ran in parallel. 

The success of the 1987 and 1992 conferences on the Statistical Data 
Analysis based on the L,-Norm and related methods made it evident that 
there is a need for regular conferences on the topic. For this reason we 
launched the third and happily brought together many new faces, especially 
those of younger statisticians. 

This volume includes 38 invited papers listed under nine headings. 

The Prologue contains two papers. One on measuring the performance 
of boundary-estimation methods by Peter Hall and Marc Raimondo and an- 
other by Roger Koenker on a new computational procedure for Lı. These 
two papers make the opening and closing lectures of the conference. Peter 
Hall and Marc Raimondo consider the problem of linear approximation to 
a curved boundary using a gridded data which is closely related to both 
curve estimation in statistics and rational approximation in number the- 
ory. They show that measures of performances based on the Lı norm 
are more appropriate for the problem than those found in Lp norms for 
p > 1. Roger Koenker’s breakthrough in computation of Lı is rather differ- 
ent from the simplex method since it does not iterate around the exterior 
of the constraint set. When there are many observations, the simplex al- 
gorithm becomes too slow in computation since it has to pass through too 
many vertices to acheived the optimal solution. His algorithm starts in 
the interior of the constraint set and does penalized Newton steps with the 
log-barrier formulation designed to keep the algorithm in the interior. 

Part one contains seven papers on estimation, testing and characteriza- 
tion. A new regression rank statistic for testing general hypothesis in a class 
of non-parametric linear model is introduced by Cornelius Gutenbrunner. 
In the same article, Gutenbrunner develops the asymptotic representation 
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of his regression rank statistics. Marc Hallin and Ivan Mizera study the uni- 
modality and asymptotics of M-estimates. Marie Huškova, after describing 
the likelihood principle for the normally distributed errors, provides the 
reader with four Lı type tests for the change point problem. The likeli- 
hood ratio, a Wald type, a score type and Bayesian type test statistics. 
She then studies the limit behavior of these tests statistics under the null 
hypothesis. Stephan Morgenthaler investigates the behaviour of residuals 
from Lı fit in linear models of designed experiments. Stephan argues that 
the residuals obtained by Lj fitting in classification models exhibit some 
weaknesses. Jana Jureckova, Klebanov and Silvelyn 20 (her signature) con- 
tribute two interesting papers on the inadmissibility of robust estimators 
with respect to the Lı-norm and the Lı estimation in nonlinear regres- 
sion and in nonlinear error-in-variables models. For testing the hypothesis 
Ho : Lg = L versus H; : Lg Æ L in linear models, Christine Müller derives 
a Wald type test statistic based on the £y-estimator with maximum rela- 
tive power under the side condition that the maximum bias is bounded by 
some bias bound. A one-way classification model is given to clarify her Lı 
test statistics. 

Part two is on computational procedures, algorithms and computer pack- 
ages. José Agullo proposes two new exact algorithms for the least median 
of squares and Mia Hubert and Peter Rousseeuw put forward new devel- 
opments in their computer package PROGRESS. Bill Farebrother outlines 
the early history of traditional estimation of observational equations from 
the third century to Roger Boscovich passing through the Tobias Mayer 
method. Steve Portnoy explains his new findings on the computation of 
Lı methods and shows that his method of computing is faster than com- 
puting least squares for large data sets with n = 10* observations. Steve 
together with Roger achieved their goal by replacing the simplex approach 
with interior point methods based on a stochastic preprocessing step that 
begins with a much smaller random subset of the data. I am sure that with 
this breakthrough algorithm the area of Lı statistical data analysis takes a 
new turn in the future. Chris Adcock, Meade, Chepoi, Cogneau, Bernard 
Fichet and Fitzenberger add their new findings to this part and provide 
directions for future developments. 

Statistical graphics are grouped together in Part three. Efstathia Bura, 
Simon Sheather and Joseph McKean present new ideas with the application 
of inverse regression for dimension reduction and Francesca Chiaramonte 
provides the foundation and philosophical reasoning behind a reduction 
paradigm. Elias Moreno proposes a Bayesian method of model selection. 
Wetzel describes key features of interactive method which construct CERES 
plots, a new class of plots that includes the partial residual plots for iden- 


tification of curvature in regression models. 

Time series analysis and financial statistics are the subject of Part four of 
this volume. Five articles by Gib Bassett, Hurst and Platen, Keith Knight, 
Hans Nyquist and Terui and Kariya discuss Lı and Lə methods for the 
analysis of time and money! 

Bruce Brown, Tom Hettmansperger, Möttönen and Oja present the rank 
plot in the affine invariant case in Part five and Fernholz, Stute and Una- 
Alvarez, Gonzalez-Manteiga and Cadarso-Suarez contribute to this non- 
parametric part with new ideas in target estimation, model check and den- 
sity estimation. 

Part six contains three contributions on multivariate analysis. Biman 
Chakraborty and Probal Chaudhuri study an extension of rank regression 
techniques to multivariate models. Philip Milasevic and Deb Nolan present 
an original paper on mode and concentration estimation in multidimen- 
sions. Nolan dedicated her joint work to the memory of her co-author 
Milasevic. By means of empirical process theory, Deb and Milasevic pro- 
vide the rate of convergence and limiting distributions of their estimators. 
Fraiman, Regina Liu and Jean Meloche study the estimation of multivari- 
ation density by probing depth. They consider a class of multivariate den- 
sities within which a density function f can be expressed as f = go D in 
some notion of data depth D and some real function g. 

In Part seven, three papers by Anil Chaturvedi and Doug Carroll, Lau- 
rence Hubert, Phipps Arabie and Jacqueline Meulman and Boris Mirkin 
discuss different approches with L,-norm to classification problems. 

Fourteen of the fifty two invited papers appear at the end of volume in 
abstract form as they are published elsewhere. 

If a volume or a manuscript has a prologue it should, in principle, have 
an epilogue. The volume is about Lı methods in all area of statistics. As 
Roger Koenker stated in his paper in the prologue section: “Despite the 
best efforts of such distinguished advocates as Laplace (1789), Edgeworth 
(1888), and Kolmogorov (1931), methods of estimation based on minimizing 
sums of absolute errors have languished in the shade of the edifice that 
Gauss built on the foundation of least squares. Why?”. The volume would 
not have been complete without some prognosis for the future of robust 
methodologies. This is why I took the liberty to call Peter Huber and 
asked him his permission to include his paper Robustness: Where are we 
now ? as an epilogue. I had remembered the story: 


Devil: Master, Master, we are in serious trouble. 
On the planet Earth, Man has discovered the truth. 
Master: Don’t worry, truth will soon become dogma. 
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Organising and hosting a conference even at this modest level, requires 
lot of courage, patience, and above all a responsible secretary. I was very 
fortunate to have a very dedicated one. I wish to express my gratitude 
to Melanie Miserez for handling with extreme care all the correspondence. 
I also owe a great debt of gratitude to Gérard Geiser for his outstanding 
TEX skill in the production of the volume and to Valentin Rousson who 
generously worked along with Gérard to make the production faster. 

I wish to acknowledge the generous support of the Swiss National Science 
Foundation (Grant No 2101-49’913.96), the Swiss Academy of Humanities 
and Social Sciences and the University of Neuchatel. Without the financial 
aid of these agencies the conference could not have been held. 

I am grateful to all those who participated the conference, and to the 
organizers of the invited sessions Dennis Cook, Willem Heiser, Jana Ju- 
reckova, Regina Liu, Joe McKean, Stephan Morgenthaler, Hans Nyquist, 
Wolgang Polasek, Steve Portnoy, Peter Rousseeuw, Gabreila Stangenhaus, 
Takeaki Kaiya, Maurizio Vichi, Jinde Wang, and Joe Whittaker, and cer- 
tainly, to the invited persons whose contribution made this volume possible. 

Finally, on the other side of the Atlantic, David Ruppert, James Sanders 
and Patti Shankland, did not stop their encouragement. Having the volume 
ready in time represents a considerable investment of their efforts and I do 
thank them very sincerely. 

It takes courage, good staff and a great number of colleagues to organize 
a conference. I certainly would not recommand this burden to anyone but 
the insane. 


University of Neuchatel Yadolah Dodge 
Switzerland Editor 
August 1997 
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Introduction 


Yadolah Dodge 


University of Neuchatel, Switzerland 


While the method of least squares (and its generalizations) have served 
statisticians well for a good many years (mainly because of mathematical 
convenience and ease of computation), and enjoys certain well known prop- 
erties within strictly Gaussian parametric models, it is recognized that out- 
liers, which arise from heavy-tailed distributions, have an unusually large 
influence on the estimates obtained by these methods. Indeed, one single 
outlier can have an arbitrary large effect on the estimate. Outlier diagnos- 
tics have been developed to detect observations with a large influence on 
the least squares estimation. For excellent books related to such diagnostics 
the reader is referred to Cook and Weisberg (1982, 1994) and Chatterjee 
and Hadi (1988). 

Parallel to diagnostic techniques, robust methods with varying degrees 
of robustness and computational complexity have been developed to modify 
the LS method so that the outliers have less influence on the final estimates. 
Among others are the bounded influence estimators, the repeated median, 
the least median of squares and the regression quantile methods. 

In 1964, Huber published what is now considered to be a classic paper 
on robust estimation of location parameter and subsequently extended to 
that linear model. The development of selected robustness concepts since 
their inception in the 1960’s and their current status, is given by Huber 
(1995). 

One of the simplest robust alternatives to LS is the least absolute value 
method. This method, which is the subject of this volume, is a widely 
recognized superior method especially well-suited to longer-tailed error dis- 
tributions, such as the Laplace distribution. 

Depending on the field of application, the least absolute value method 
has been studied in several contexts under a variety of names such as min- 
imum, or least sums of absolute errors, deviations or values; and here we 


X1V 


refer to it as the Lı- norm method (for minimizing the Lı- norm of the 
vector of deviations). The Lı method estimates the unknown parameters 
in a stochastic model so as to minimize the sum of the absolute deviations 
of a given set of observations from the values predicted by the model. 

Historically, Lı estimation is the oldest of all robust methods. The 
method of least absolute deviations was introduced almost 50 years before 
the method of least squares, in 1757 by Roger Joseph Boscovich (1711- 
1787). He devised the method as a way to reconcile inconsistent measure- 
ments for estimating the shape of the earth. After Pierre Simon, Laplace 
adopted the method 30 years later, it saw occasional use but was soon 
overshadowed by the method of least squares. The popularity of least 
squares was at least partly due to the relative simplicity of its compu- 
tations and to the supporting theory that was developed by Gauss and 
Laplace. Laplace, in his second memoir on the Figure of the Earth in 1789, 
adopted Boscovich’s two criteria for a line of best fit, and gave an algebraic 
formulation and derivation of Boscovich’s algorithm. 

After nearly seventy years following the publication of Laplace’s second 
supplement to the Théorie Analytique des Probabilités (1818), Edgeworth 
(1887) presented a method for linear regression using Lı method. But 
since the publication of Edgeworth’s work , few attempts have been made 
to convince the statisticians and particularly the applied users to employ 
this method (see Turner, 1887; Rhodes, 1930; Singleton, 1940; Karst,1958). 
Reasons for such a long silence may be summarized as follows : 

(1) Computational difficulties in producing the numeric values of the Lı 
estimates in regression. (Lack of closed form formulae similar to that of 
least squares). 

(2) Lack of an asymptotic theory for Lı estimation in the regression 
model, and more generally the nonexistence of accompanying statistical 
inference procedures. 

(3) Insufficient evidence to show the superiority of the small sample 
properties of Lı estimation compared to the LS estimators when sampling 
from long tailed distributions. 

Following the work of Charnes, Cooper and Ferguson (1955) a renewed 
interest in using Lı estimation for regression problem was created. They 
showed the equivalence between the Lı problem and a linear programming 
problem. Wagner (1959) suggested that the Lı problem in a linear regres- 
sion of the form y; = 09+01271,+-:-+6p)%pi+€, or in matrix form Y = X0+e 
can be solved by solving the dual of the Lı problem. He also observed that 
the dual problem can be reduced to a problem with a smaller basis but 
where the dual variables have upper-bound restrictions. Wagner’s formu- 
lation of the problem is to restate the problem of minimizing J` |e;| with 
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respect to 0 where £, is the deviation between the observed and predicted 
values of the i” observation Y;, as: minimize J` |e;| , subject to XO+e = Y, 
where 0, £ unrestricted in sign. 

Noting the fact that |e;| = €1;-+¢2; where £1 = € if € > 0, 0 otherwise and 
E€2 = —€ if e < 0, 0 otherwise, that both are nonnegative and €; = €1; — Ezi, 
we can reformulate the problem as a linear programming problem: minimize 
X E1 + Dd) Ezi, subject to XO + £1 — Eg = Y, where 0 unrestricted in sign, 
€1,€2> 0. 

From the computational point of view, the Lı method is now extremely 
simple and it requires only a routine to fit the Lı regression. There are sev- 
eral computer programs available for calculation of Lı estimates. See for 
example Sadovski (1974) and Farebrother (1988). For the case of multiple 
regresssion we can use, for example, the modified simplex algorithm of Bar- 
rodale and Roberts (1973) that exists in the IMSL library under the name 
RLLAV. The L; estimation problem with additional linear restrictions (re- 
stricted Lı problem) is considered along the same lines in Barrodale and 
Roberts (1974). Arthanari and Dodge (1993) devoted a complete chapter 
on computational aspects of the Lı estimation. Lı regression estimates are 
also obtainable from the function |1fit in the computer language S-Plus and 
from the robust regression package ROBSYS (Marazzi, 1993). Detection of 
outlying points in both dependent and independent variables in regression 
model are explained in Dodge (1997). 

The major difficulty for applied researchers in using Lı estimation for 
many years was the lack of accompanying statistical inference procedures. 
Such procedures would include methods for testing general linear hypoth- 
esis, obtaining confidence intervals, analysis of variance tables and for per- 
forming multiple comparison procedures. 

Bassett and Koenker (1978) developed the asymptotic theory for Lı 
estimators in the regression model. Their finding is considered to be a 
breakthrough for the problem. Their main result is that the sampling 
distribution of Lı estimators will be asymptotically normal with a specified 
mean and variance. Under very general assumptions they confirmed that 
the Lı estimator @ has a normal distribution with mean @ and covariance 
matrix T?(X'X)~! where 77/n is the asymptotic variance of the sample 
median from random samples of size n taken from the error distribution 
with a continuous and positive derivative at the median. This result is 
remarkably similar to that for LS. Therefore the Lı confidence intervals for 
an estimable function 06 is 


NÔ zaj Fy N(X'X)A 


where (X'X) is the q-inverse of X’X and 7 is an estimate of 7 given in 
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McKean and Schrader (1987). 

Koenker and Bassett (1982) investigate the asymptotic distribution of 
three alternative Lı test statistics of a linear hypothesis in the standard 
linear model. They showed that the three test statistics, which correspond 
to the Wald, likelihood ratio and Lagrange multiplier tests, under mild 
regularity conditions on the design and error distribution, have the same 
limiting chi-square behavior. For a complete treatment of Lı regression the 
reader is referred to Chapter 4 of Birkes and Dodge (1993). 

With the availability of many computationally efficient algorithms and 
developed inference procedures for testing general linear hypotheses, for 
obtaining confidence intervals, selection of variables, analysis of variance 
tables and multiple comparison, it is hoped that Lı estimation methods 
will be employed more often by researchers in applied sciences than before. 

Certainly, there are many other areas of statistical data analysis based 
on the Lı-norm (such as density estimation, time series analysis, multi- 
variate analysis and classification methods) that could have been discussed 
here. But, unfortunately, limitation of space and time have caused many 
interesting and important lines of research to be treated lightly or not at 
all. While it is now evident that no single robust procedure is best by any 
criteria, it may be appropriate (or at least reasonable) to use adaptive con- 
vex combinations of Lı with other methods rather than a single criterion to 
estimate the unknown parameters. However, for the error distributions for 
which the median is superior to the mean as an estimator of location, Lı 
estimation is certainly preferred to least squares and strongly recommended 
for use in these cases. 

Bloomfield and Steiger (1983), Devroye and Györfi (1985), and Gonin 
and Money (1989) are the only books entirely devoted to Lı topic. The 
authors of these three texts had the courage to pull together a rich and 
diverse literature in this field. I hope that the proceedings contained in this 
volume and its predecessors, Dodge (1987, 1992), will encourage someone 
to write a fourth. 
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Abstracts 


Asymptotic properties of L, estimators in a multi-stage dose- 
response model: A Monte Carlo study 


Carmen D.S. André, University of São Paulo, Brazil 
Subhash C. Narula, Virginia Commonwealth University, Richmond, USA 


Abstract: The single-stage and two-stage dose response models are fre- 
quently used in practical applications. The maximum likelihood and the 
least squares principles are often used to estimate the unknown parameters 
of the model. It has been shown that these methods are sensitive to outliers 
in the data. The minimum sum of absolute errors MSAE (or Lı) criterion 
is more resistant to outliers than these popular procedures. However, at 
present not much is known about the statistical properties of the MSAE 
estimators of the parameters of the multistage dose-response model. In 
this paper, our objective is to study asymptotic properties and distribution 
of the MSAFE estimators of the single-stage and two-stage dose-response 
models by simulation and to find the smallest sample size for which we 
may use the asymptotic distribution to draw statistical inferences about 
the parameters. We also give an approximate expression for the variance of 
these estimators when their asymptotic distribution follows a multinormal 
distribution. 


The L!-norm and interlaboratory tests 


P. L. Davies, University of Essen, Germany 


Abstract: The form of interlaboratory test we consider is that where 
each of J laboratories returns exactly one reading for each of J sanples. 
Such a test may be described by the random effects model 


Xy=H=Xtataj Isi<sl, lsd. 


The X; represent the laboratory effects, the a the sample conaminations 
and the ¢;; the measurement errors. The problem is to identify outlying 
observations and outlying laboratories. As we have only one observation 
per cell it is commonly believed that it is not possible to detect outliers or, 
equivalently, non-additivity. As shown in Terbeck and Davies (1996) this 
is not correct and so called unconditionally identifiable outlier patterns 
may be found by the L!- or an appropriate M-functional. The results 


of Terbeck and Davies are improved in certain respects and then applied 
to the random effects model. The method is applied to a real data set 
considered by Lischer (1993) which is concerned with the determination of 
lead in sewage sludge. 


Multivariate L} mean 


Yadolah Dodge and Valentin Rousson, 
University of Neuchatel, Switzerland 


Abstract: The center of a univariate data set {21,...,%n} can be de- 
fined as the point u that minimizes the norm of the vector of distances 
y = (|zı — H|, ..., |En — u|). As the median and the mean are the mini- 
mizers of respectively the Lı- and the L2-norm of y, they are two alter- 
natives to describe the center of a univariate data set. The center u of a 
multivariate data set {x1,...,Xn} can also be defined as minimizer of the 
norm of a vector of distances. In multivariate situations however, there 
are several kinds of distances. In this paper, we consider the vector of L4- 
distances y, = (||x, — elli; -~ |[x,, — w|i) and the vector of Lo-distances 
yo = (||x1 — pI2, -.., ||Xn — ||2). We define the L,-median and the Li-mean 
as the minimizers of respectively the Lı- and the Lo-norm of y1; and then 
the [o-median and the L2-mean as the minimizers of respectively the Lı- 
and the L2-norm of y2. In doing so, we obtain four alternatives to describe 
the center of a multivariate data set. While three of them have been already 
investigated in the statistical literature, the Lj-mean appears to be a new 
concept. Contrary to the L,-median, the L;-mean is proved to be unique in 
almost all situations. In order to compare these multivariate medians and 
means, we use the rule of the net advantage coefficient introduced by Stavig 
and Gibbons (1977). A simulation study shows that the L;-mean performs 
well, especially for data sets drawn from bivariate Laplace distribution. 


The information for the direction of dependence in 
Lı regression 


Yadolah Dodge, University of Neuchatel, Switzerland 
Joe Whittaker, Lancaster University, U.K. 


Abstract: An Lı regression model for a response variable X% is to sup- 
pose that the conditional distribution of Xə given Xj, is Laplace, and that 
the marginal distribution of the explanatory variable X; is also Laplace. 
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We show that there is information to distinguish the direction of depen- 
dence Xı and X2; or equivalently to distinguish between the models in 
which X; is dependent on X2, and X> is dependent on X4. This is not true 
for Lo regression based on the Normal distribution. 


Dimension choice for sliced inverse regression based on ranks 


Louis Ferré, Université Paul Sabatier, Toulouse, France 


Abstract: Sliced Inverse Regression is a method for reducing the di- 
mensionality in multivariate non parametric regression problems. While ` 
the selection of the dimensionality has been investigated for the original 
version, no solution has been proposed for Hsing and Carroll (1992) ap- 
proach based on order statistics and associated concomitant variables. By 
using model selection approaches, we propose here two ways for selecting 
the dimensionality by estimating a loss function: first, a direct estimation 
is proposed and, then a Jack-Knifed estimate is investigated. Finally, the 
rank version is compared to classical SIR on a real life data set. 


[,-norm and L -norm methodology in cluster 
analysis 


Allan D. Gordon, University of St Andrews, North Haugh, Scotland 


Abstract: An overview is provided of the use of Li-norm and Lo-norm 
methodology in cluster analysis. Topics covered include dissimilarity mea- 
sures, partitions, fuzzy classifications, hierarchical classifications, and con- 
sensus Classifications. 


Some issues in the applications of conditional 
quantile functions 


Xuming He, University of Illinois, USA 


Abstract: Conditional quantile functions are useful in a variety of appli- 
cations. Regression quantiles for linear models have been recently extended 
to semiparametric and nonparametric models. Further investigations are 
needed for both the statistical theory and computations. In this paper, 
I attempt to raise two questions that I believe are important to build a 
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solid foundation for the applications of quantile regression. They focus on 
nearly extreme quantiles and the problem of crossing in estimated quantile 
functions. 


The median function on structured metric spaces 


F. R. McMorris, University of Louisville, USA 


Abstract: When (X,d) is a finite metric space and m = (£1,..., £k) € 
k 
_X*, a median for 7 is a element x of X for which )> d(x, x;) is minimum. 


The function that returns the set of all medians for aay tuple m is called 
the median function on X. A brief survey is given of some of the results 
concerning the median function, starting with an arbitrary metric space and 
finishing with the case where X is a set of hypergraphs and d is the metric 
based on the L;-norm. A simplistic maximum likelihood interpretation for 
the median function is also given. 


Least absolute value estimation of a linear 
functional model 


Edina Shisue Miazaki, Statistika Consultoria, Campinas, Brazil 
Gabriela Stangenhaus, Universidade de Brasília, Brazil 


Abstract: This paper presents two robust Lı based estimators for the pa- 
rameters of a simple linear functional relationship (SLFR). The maximum 
likelihood estimation when the errors follow a double exponencial distribu- 
tion and the weighted L; estimation are solved as non-linear optimization 
problems. The least median of squares estimates are proposed as starting 
values and the scale measures of the errors are based on the MAD. Both 
methods are resistant to outlying observations and the weighted Lı estima- 
tor is resistant to leverage points. Real examples illustrate the methods. 


ANOVA - models: A Bayesian analysis 
Wolgang Polasek and Shuangzhe Liu, University of Basel, Switzerland 


Abstract: The 1-way and 2-way ANOVA are formulated as Bayesian lin- 
ear models with conjugate prior distributions. The classical case is treated 
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as a special one using matrix generalized (g-) inverses leading to so-called 
OLS~ and OLS* estimates of the rank deficient ANOVA model. The 2- 
way ANOVA model without interactions can also be estimated in a 2-step 
procedure. 


Robustifying growth curve model estimation 


Gabriela Stangenhaus, Statistika Consultoria, Campinas, Brazil 
Elisete C. Quintaneiro Aubin, Universidade de São Paulo, Brazil 


Abstract: A robustified version of the parameter matrix estimators in 
the standard growth curve model obtained via the Potthoff-Roy transfor- 
mation is presented. The asymptotic distribution of the robust estimators is 
derived and the estimation of their variance-covariance matrix is discussed. 


Fitting L norm classification models to complex data sets 


Maurizio Vichi, University “G.DAnnunzio” di Chieti, Pescara, Italy 


Abstract: In this paper methodologies for fitting classification models 
(dendrograms and partitions) to two and three-way arrays of dissimilarities 
minimizing a Lə norm loss function are examined. A new algorithm for 
fitting several hierarchical classifications to quite large three-way arrays is 
also discussed. 


Applications of mathematical programming in 
[,-estimation of nonlinear models 


Jinde Wang, Nanjing University, China 


Abstract: In this paper we review the results in L,-estimation of nonlin- 
ear models obtained by applying mathematical programming techniques. 
We describe briefly the ways to find the asymptotic distribution, the ap- 
proximate representation and to treat dependent random error cases and 
inequality-constrained cases. With these results one can conclude that 
mathematical programming is a suitable tool for studying L-estimation 
problems. 
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Some contributions to M-estimation in regression models 


Lincheng Zhao, 
University of Science and Technology of China, Hefei, China 


Abstract: In this paper, we briefly survey some contributions to asymp- 
totic theory on M-estimation in a linear model and least absolute devia- 
tions (LAD) estimation in a censored regression model (known as the Tobit 
model), as well as on the relevant test criteria in ANOVA in the above mod- 
els. As a general approach on statistical data analysis, asymptotic theory 
of M-estimation in regression models has received extensive attention. In 
recent years, the author and some of his cooperators worked on this field 
and obtained some new results. In this paper we briefly introduce some of 
them and the related work in the literature. As a special case, the minimum 
Ly-norm (MLN) estimation, also known as the least absolute deviations 
(LAD) estimation, plays an important role and is of special interest. Con- 
sidering this point, we will pay much attention to them as well. Our topics 
concern the usual linear model and a censored regression model, known as 
the Tobit model in econometric research. 


Contributors 


Adcock Christopher, School of Economic and Business Studies, Univ. of 
Westminster, 309 Regent Streest, W1R 8AL London, U.K. 


Agullo Jose, Dept. Fundamentos del Analisis Economico, Universidad 
de Alicante, 03080 Alicante, SPAIN 


Andre D.S. Carmen, Instituto de Matematica e Estatistica da USP, Rua 
do Matao 1010, C.P. 66281 - Agencia Cidade de Sao Paolo, CEP 05315-970 
Sao Paulo-SP, BRAZIL 


Arabie Phipps, Graduate School of Management, Rutgers University, 
92, New Street, NJ 07102-1895 Newark, USA 


Aubin Elisete, Instituto de Matematica e Estatistica da USP, Rua do 
Matao, 1010, CEP, 05508-900 Sao Paulo, SP, BRAZIL 


Bassett Gilbert W., Dept. of Economics, Univ. Illinois-Chicago, 601 S. 
Morgan St, Rm 2103, 60607-7121 Chicago/ Illinois, USA 


Brown Bruce M., Univesity of Tasmania-Math, P.O. Box 252 C, 7001 
Hobart Tasmania, AUSTRALIA 


Bura Efstathia, Dept. of Statistics, George Washington Univ., DC 20052 
Washington, USA 


Cadarso-Suarez Carmen, Dept de Estadistica e I.O., Seccion de Bioes- 
tadistica, Facultad de Medicina, Universidad de Santiago, La Coruna San- 
tiago de Compostela, SPAIN 


Carroll Douglas, Faculty of Management/ Marketing, Rutgers Univ., 81 
New St, NJ 07102-1895 Newark, USA 


Chakraborty Biman, Theoretical Statistics & Mathematics Unit, Indian 
Statistical Institute, 203 B.T. Road, 700035 Calcutta, INDIA 


Chaturvedi Anil, Technology Consultant, AT&T Labs, Room 5C-134, 
600 Mountain Avenue, NJ 07974 Murray Hill, USA 


Chaudhuri Probal, Theorical Stats & Math., Indian Stat. Inst., 8 Ramlal 
Agarwalla Lane, 700050 Calcutta, INDIA 


Chepoi Victor, Laboratoire de Biomathmatiques, Faculté de Médecine, 
Université d’Aix Marseille II, 27 Bd Jean Moulin, 13385 Marseille Cedex 
5, FRANCE 


Chiaromonte Francesca, International Institute for Applied Systems Anal- 


XXV] 


ysis, A-2361 Laxenburg, AUSTRIA 


Cogneau Daniel, Faculté de Médecine, Laboratoire de Biomathématique, 
27, Bd. Jean Moulin, 13385 Marseille, FRANCE 


Davies Laurie, Univ. Gesamthochschule Essen, FB6 Maths, Universi- 
tatstrasse 3, 4300 Essen 1, GERMANY 


de Una Alvarez Jacobo, Depto de Estadistica e I.O., Campus Lagoas- 
Marcosende, Universidad de Vigo, 86200 Vigo, SPAIN 


Dodge Yadolah, Groupe de Statistique, Université de Neuchâtel, Pierre- 
à-Mazel 7, 2000 Neuchâtel, SWITZERLAND 


Farebrother Richard William, 11 Castle Road, Bayston Hill, Shrewsbury, 
SY3 ONF Shropshire, U.K. 


Fernholz Luisa, Statistics Dept., Temple Univ., Speakman Hall, PA 
19122 Philadelphia, USA 


Ferré Louis, Laboratoire de Statistique, Université Paul Sabatier, 118 
route de Narbonne, 31062 Toulouse Cedex, FRANCE 


Fichet Bernard, Faculté de Médecine, Laboratoire de Biomathématique, 
27, Bd. Jean Moulin, 13385 Marseille, FRANCE 


Fitzenberger Bernd, Department of Economics, and Statistics, Univer- 


sity of Konstanz, P.O. Box 5590 < D 139 >, D-78434 Konstanz, GER- 
MANY 


Fraiman Ricardo, Centro de Matematica, Universidad de la Republica- 
Uruguay, Eduardo Acevedo 1139, Montevideo, URUGUAY 


Gordon Allan D., Mathematical Institute, University of St Andrews, 
North Haugh, St Andrews KY16 9ss, Fife, SCOTLAND 


Gonzalez-Manteiga Wenceslao, Departimento de Estadistica, Facultad 
de Matematicas, Universidad de Santiago de Compostela, La Coruna, SPAIN 


Gutenbrunner Cornelius, Uni-Klinikum, Abteilung für Kinder und Ju- 
gendpsychiatrie, Hans-Sachs Strasse 6, 35033 Marburg, GERMANY 


Hall Peter, Dept. of Statistics, A.N.U., G.P.O. Box 4, A.C.T. 2601 
Canberra, AUSTRALIA 


Hallin Marc, Institut de Statistique, Campus de la Plaine, Univ. Libre 
de Bruxelles, C.P. 210, B-1050 Bruxelles, BELGIUM 


He Xuming, Dept. of Statistics, Univ. of Illinois-Champaign, 725 S 
Wright St, IL 61820-5714 Champaign, USA 


xxvii 


Hettmansperger Tom, Pennsylvania State Univ., Dept. of Statistics, 317 
Classroom Building, PA 16802-2111 University Park, USA 


Huber, Peter J., LSt. Mathematik, Universität Bayreuth, Universität- 
strasse 30, 95440 Bayreuth, GERMANY 


Hubert Lawrence, Dept. of Psychology (and Dept. of Statistics), Uni- 
versity of Illinois-Champaign, 603 East Daniel Street, IL 61820-6232 Cham- 
paign, USA 

Hubert Mia, Dept of Mathematics, Universitaire Inst. Antwerpen, Uni- 
versiteitsplein 1, 2610 Wilrijk, BELGIUM 


Hurst Simon R., Centre for Financial Mathematics, School of Mathe- 
matical Sciences, Australian National Univ., ACT 0200 Canberra, AUS- 
TRALIA 


Huskova Marie, Charles Univ., Dept. of Probability and Stat., Sokolovska 
83, 18600 Prague 8, CZECH REPUBLIC 


Jureckova Jana, Charles Univ., Dept. of Probability and Stat., Sokolovska 
83, 18600 Prague 8, CZECH REPUBLIC 


Kariya Takeaki, Institute of Economic Research, Hitotsubashi Univ. Ku- 
nitachi, 186 Tokyo, JAPAN 


Klebanov L.B., Institute of Mathematical Geology, St Petersburg, RUS- 
SIA 


Knight Keith, Dept. of Statistics, Univ. of Toronto, 100 St-George 
Street, M5S 1A1 Toronto, CANADA 


Koenker Roger, Dept. of Economics, Univ. of Illinois, 1206 S 6th St, IL 
61820-6915 Champaign, USA 


Liu Regina, Dept. of Statistics-Hill Cntr, Rutgers University, NJ 08855 
Piscataway, USA 


Liu Shuangzhe, Institute of Statistics and Econometrics, Holbeinstrasse 
12, 4051 Basel, SWITZERLAND 


McMorris F.R., Univ. of Louisville, Dept. of Mathematics, KY 402 92 
Louisville, USA 


McKean Joe, Dept. of Math & Statistics, Western Michigan Univ., MI 
49 008 Kalamazoo, USA 


Meade Nigel, The Management School, Imperial College, 53 Princes 
Gate, SW7 2PG London, U.K. 


xxviii 


Meloche Jean, Dept. of Statistics, 2021 West Mall, Univ.'` of British 
Columbia, BC V6T 1W5 Vancouver, CANADA 


Meulman Jacqueline J., Department of Data Theory, Leiden University, 
P.O. Box 9555, 2300 RB Leiden, THE NETHERLANDS 


Miazaki Edina Shisue, C.P. 04393, 70919-970 Brasilia, D.F., BRAZIL 


Mirkin Boris G., Division of Theoretical Bioinformatics (Abt 0815), Ger- 
man Cancer Research Center (DKFZ), Im Neuenheimer Feld 280, 69120 
Heidelberg, GERMANY 


Mizera Ivan, Department of Probability and Statistics, Comenius Uni- 
versity, Mlynska dolina, 84215 Bratislava, SLOVAKIA 


Moreno Elias, Department of Statistics, Faculty of Science, University 
of Granada, 18071 Granada, SPAIN 


Morgenthaler Stephan, EPFL, Dept. of Mathematics, 1015 Lausanne, 
SWITZERLAND 


Mottonen Jyrki, Department of Mathematical Sciences, University of 
Oulu, 90570 Oulu, FINLAND 


Muller Christine, Freie Universitat Berlin, Fachbereich Math. & Infor- 
matik, WE1, Arnimalle 2-6, 14195 Berlin, GERMANY 


Narula Subash C., School of Business, Virginia Commonwealth Univ., 
1015 Floyd Avenue, VA 23284-4000 Richmond, USA 


Nolan Deborah A., Dept. of Statistics, 367 Evans Hall, Univ. of Cali- 
fornia, CA 94720-3860 Berkeley, USA 


Nyquist Hans, Swedish Univ. of Agriculture, Dept. of Biometry, 901 83 
Umea, SWEDEN 


Oja Hannu, Dept. of Applied Maths and Statistics, Linnanmaa, Univ. 
Oulu, 90570 Oulu 57, FINLAND 


Platen Eckhard, Centre for Financial Mathematics, School of Mathe- 
matical Sciences, The Australia National University, 0200 Canberra ACT, 
AUSTRALIA 


Polasek Wolfgang, Institute of Statistics and Econometrics, Holbein- 
strasse 12, 4051 Basel, SWITZERLAND 


Portnoy Steve, Dept. of Statistics, Univ. of Illinois-Champaign, 725 S 
Wright St, IL 61820-5714 Champaign, USA 


Raimondo Marc, Australian National University, Department of statis- 


XX1X 


tics, G.P.O. Box 4, A.C.T. 2601 Canberra, AUSTRALIA 


Rousseeuw Peter, Dept. of Mathematics, Universitaire Inst. Antwerpen, 
Universiteitsplein 1, 2610 Wilrijk, BELGIUM 


Rousson Valentin, Groupe de Statistique, Université de Neuchatel, Pierre- 
a-Mazel 7, 2000 Neuchatel, SWITZERLAND 


Sheather Simon, Australian Graduate School of Managment, University 
of New South Wales, 2033 Kensington NSW, AUSTRALIA 


Stangenhaus Gabriela, Rua Americo de Campos 133, 13083-040 Camp- 
inas SP, BRAZIL 


Stute Winfried, Math. Institut, Univ. Giessen, Arndtastr. 2, 35392 
Giessen, GERMANY 


Terui Nobuhiko, Tinbergen Institute, Rotterdam, Burg Oudlaan 50, PA 
3062 Rotterdam, THE NETHERLANDS 


Vichi Maurizio, Universit “G. D’Annunzio” di chieti, Dept. of Metodi 
Quantitativi e Teoria Economica, Viale Pindaro 42, 65127 Pescara, ITALY 


Wang Jinde, Dept. of Mathematics, Nanjing Univ., 22, Hankou Road, 
210008 Nanjing, CHINA 


Wetzel Nate, Dept. of Mathematical Sciences, Binghamton Univ., NY 
13902-6000 Binghamton, USA 


Whittaker Joe, Dept of Mathematics, Lancaster University, LA1 4YF 
Lancaster, U.K. 


Zhao Lincheng, Department of Mathematics, Univ. of Science & Techol- 
ogy, HEFEI, 230026 Anhui, CHINA 


Zwanzig Silvelyn, Institut für Mathematische Stochastik, Universitat 
Hamburg, Bundesstrasse 55, 20146 Hamburg, GERMANY 


Me: 


The Sage: 


Sohravardi: 


PROLOGUE 


Which way must I follow? 


If you are a real pilgrim, you will achieve 
the journey whichever way you go. 
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L -Statistical Procedures and Related Topics 
IMS Lecture Notes —- Monograph Series (1997) Volume 31 


Measuring the performance of 
boundary-estimation methods 


Peter Hall 


Australian National University, Canberra, Australia and 
CSIRO Mathematical and Information Sciences, Sydney, Australia 


Marc Raimondo 


Australian National University, Canberra, Australia 


Abstract: The problem of local linear approximation to a curved bound- 
ary using gridded data is closely connected to both curve estimation 
methods in statistics and rational approximation in number theory. The 
problem is ill-posed, in the sense that orders of approximation at arbi- 
trarily close points can be very different. This may be interpreted as a 
consequence of the problem’s number-theoretic aspects, since irrational 
numbers with arbitrarily slowly convergent rational approximations are 
distributed in dense sets. On the other hand, by measuring performance 
in a “statistically average” way which excludes most of the pathologies, 
we may deduce useful results about optimal orders of approximation. In 
this respect, among others, statistical approaches to the problem are im- 
portant. For example, measures of performance based on the L! norm 
are more appropriate than those founded on L? norms for p > 1. The 
paper will describe these viewpoints, and outline the way in which they 
may be combined to produce a cohesive theory of curve estimation from 
gridded data. We shall start with the relatively simple case of approxima- 
tion to a simple linear boundary, where data are observed without noise, 
and progress through an analysis of the number-theoretic connections, 
concluding with results in the context of stochastic or curved boundaries 
observed with noise. 


Key words: Curve estimation, edge, gradient, grid, integral metric, irra- 
tional number, nonparametric, rational number, slope, vertex. 


AMS subject classification: Primary 60G35; secondary 62G20. 


2 Peter Hall and Marc Raimondo 


1 Defining a sraight-line boundary 


Imagine placing a straight line across a square lattice in the plane, thereby 
dividing the plane into two parts. Assuming that the line is not vertical, 
colour black those lattice vertices above the line and white the vertices 
below, with a third colour (red, say) for any vertices that lie on the line. 
Now remove the line, and attempt to reconstruct it from the pattern of 
vertex colours. This is a theoretical idealisation of a range of practical 
boundary estimation problems, where a curve representing the boundary 
between two areas of different colour is to be estimated from pixel data. 

Even a brief consideration of this problem shows that its solution de- 
pends critically on the nature of the slope of the line. For example, if the 
slope is rational and if the line passes through some vertex, then the line 
necessarily passes through an infinite number of vertices. In this case, if 
we were able to observe the vertex colour pattern in a large enough region 
of the plane, we would see that there are at least two red vertices there, 
and from them we could trivially deduce the equation of the line. Then, 
we would know the line exactly. 

On the other hand, if the line has rational slope but does not pass 
through any vertex, it cannot be determined exactly even if we know the 
colour of every vertex in the plane. This is perhaps most easily seen if the 
line, £ say, is parallel to one of the axes of the square lattice. In that case 
there exists an infinite strip in the plane, with its sides parallel to the line 
and its width equal to the edge width of the lattice, such that any straight 
line contained wholly within the strip produces exactly the same vertex 
colour pattern as £. 

A similar situation arises for any line with rational gradient, where the 
intercept is chosen so that the line does not pass through any vertex. In 
such cases, while the gradient may be determined exactly from vertex colour 
data within the whole plane, the intercept will remain unknown beyond the 
fact that it lies within a certain nondegenerate interval — except when the 
line passes through a vertex. So, in the case of a line with rational slope we 
know either everything or nothing: either we can compute the line exactly 
from a finite amount of vertex colour data (when the line passes through 
a vertex) or we cannot compute it exactly even from an infinite amount of 
data (if it does not pass through any vertex). 

The situation is quite different if the line has irrational slope. There, if 
the colour pattern is observed within an increasingly large region R, say 
an n xn section of the lattice centred roughly on the line, then an approxi- 
mation to the line may be constructed using only the colour pattern within 
R. As R expands, the accuracy with which the line may be approximated 


Measuring the performance of boundary-estimation methods 3 


increases. More explicitly, we may compute an approximation l= L(P) to 
L, using only the vertex colour pattern P within R, such that the Hausdorff 
distance between LAR and LAR converges to zero as R increases. 

In the case of irrational slope the rate of convergence of a good ap- 
proximation l depends intimately on the nature of the irrational slope. 
It depends hardly at all on whether £ intersects a lattice vertex; this in- 
fluences only the constant multiple of the optimal rate of convergence of 
Ê to L, not the rate itself. Thus, the problem of approximating straight- 
line boundaries is starkly ill-posed, since nearby slopes can produce very 
different convergence rates along infinite subsequences. 

In Section 2 we shall treat examples of classes of irrational numbers, 
which capture the spirit of the boundary approximation problem and its 
solution. Section 3 will employ the examples to motivate development of 
more general boundary approximation problems, and will discuss ways in 
which the problems might be tackled. Section 4 will briefly survey the 
number-theoretic background to the methods. Later sections will develop 
theories for curved boundary estimation using local linear methods, bor- 
rowing ideas that are now well understood in more traditional statistical 
settings. For the latter, the reader is referred to Wand and Jones (1995, 
Chapter 5) and Fan and Gijbels (1996). 

In Sections 1-5 we shall always assume that the lattice is fixed; without 
loss of generality it has its vertices at integer pairs (i,j) in the Cartesian 
plane, so that its edge width (the width of the side of the smallest square 
face of the lattice) is 1. In later sections we shall sometimes consider lattices 
of increasing fineness, so as to model the physical problem of approximat- 
ing a curved boundary on a fine pixel grid. Technical details behind our 
arguments may be found in Hall and Raimondo (1996a,b). 

While we shall concentrate on the case of a square lattice, for definite- 
ness, the results that we shall describe are valid for any regular lattice that 
has the property that it contains a square lattice and is contained within 
the union of a finite number of square lattices. Thus, our results are avail- 
able for lattices whose faces are hexagons or triangles. Lattices of the latter 
type are used in practice in J.P. Serra’s image analyser. When considering 
an “n x n” portion of a general lattice we interpret n as the square root of 
the number of vertices within a finite, square subset of the lattice. 


2 Classes of irrational numbers 


The irrational numbers with which many of us are most familiar are the so- 
called “quadratic irrationals”, defined as the set of real numbers that may 
be expressed as solutions of quadratic equations with rational coefficients 
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(or, without loss of generality, integer coefficients). These are a subset of the 
class of so-called periodic irrationals, and also of the larger class of badly 
approximable irrationals, which we shall discuss in Section 5. Straight- 
line boundaries with slope coming from one of these classes have special 
properties with respect to the boundary approximation problem. Indeed, 
in such cases the optimal rate of convergence (in the sense of the Hausdorff 
metric) of approximations based on vertex colours within an n x n subset 
of the lattice, is asymptotic to a constant multiple of n~t. 

The set of algebraic irrationals is larger than the class of quadratic irra- 
tionals, and is defined as the set of all real numbers that may be expressed 
as solutions to polynomial equations with rational coefficients. However, 
the most accurate available estimate of the rate of convergence in the lin- 
ear boundary approximation problem for boundaries with slope equal to an 
algebraic irrational, is only the upper bound of O(n~!**) for all € > 0. Not 
even logarithmic refinements, such as O(n! logn), are available. The up- 
per bound O(n~'**) is a corollary of deep number-theoretic work of Roth 
(1955), who determined the exact exponent in the Thue-Siegel inequality 
and for which work he was awarded the Fields medal in 1958. 

Roth’s result is virtually equivalent to the upper bound O(n~'**), for 
all € > 0, in our approximation problem. If we could improve on that 
rate then we could refine Roth’s Theorem, as it is known. And of course, 
even if we could refine Roth’s result, we would still have only scratched 
the surface as far as solving our problem goes, since the great majority of 
irrational numbers are not algebraic. Indeed, since the number of rationals 
is countable then the number of polynomial equations of degree p with ra- 
tional coefficients is countable. Therefore, the number of solutions of such 
equations, for arbitrary p, is countable. Hence, the number of algebraic 
irrationals is countable, whereas the number of irrational numbers is un- 
countably infinite. By focusing only on algebraic irrationals we would be 
missing the great majority of irrational numbers. 

It might be thought that because the algebraic irrationals are dense in 
the real line, they provide a good guide to the sort of behaviour that will be 
experienced when the slope of the line is a non-algebraic irrational. While 
this is true from some viewpoints, the argument has limitations. Indeed, 
irrationals that are not algebraic, and produce particularly pathological 
convergence rates in our line approximation problem, also comprise a dense 
subset of the real line. To elucidate this point we mention that if a1, @2,... 
and 3), Go,... are any two sequences of positive numbers converging to zero, 
then there exists a dense set of irrational numbers such that, whenever the 
slope of the linear boundary is in this set, the optimal rate of convergence in 
our approximation problem on an n x n grid is bounded above by a, along 
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one subsequence of values of n, and bounded below by Bn along another 
subsequence. 

This is perhaps not a major issue if we confine attention to exactly linear 
boundaries — we may simply exclude such pathological irrational numbers 
from contention as possible gradients. However, in the problem of local lin- 
ear estimation of a curved boundary the range of values of the gradient is an 
interval, and so includes representatives from any set which is dense in the 
real line. This fact, and the properties of irrational numbers noted above, 
make it clear that one must take care when defining boundary-estimation 
problems, to avoid becoming side-tracked by relatively unimportant cases. 


3 Defining boundary-estimation problems 


We need to pose boundary-estimation problems in such a way that we can 
deduce relatively simple principles behind rate-of-convergence properties. 
For that, we need some way of averaging over all possible choices of irra- 
tional gradients, so that the central issues in the problem will not be lost 
in through consideration of pathological special cases. There are at least 
two ways of doing this. 

First, we might allow the slope of the boundary to be a random vari- 
able, and devote our discussion to its “average” properties. This is feasible 
for either straight or curved boundaries. If the boundary is linear then 
we may apply a random rotation to it, and more generally we may regard 
the boundary as a realization of a random curve whose equation is repre- 
sented by y = G(x), where G is a random, smooth function. Alternatively, 
we may choose to treat the boundary as fixed and curved, but estimate 
it at a randomly chosen point. Under such models we do not need to be 
too prescriptive about the type of averaging, since the more radical of the 
pathological cases described in the previous section arise only for sets of 
irrational numbers having measure zero. Therefore, if the random bound- 
ary, or the random point at which we estimate a fixed, curved boundary, is 
distributed in the continuum, then, by confining attention to almost sure 
properties we avoid all but reasonably regular cases. We shall outline this 
approach in Section 6. 

Alternatively, in the case of a curved boundary we may average approxi- 
mations in some way, for example by considering them in an integral metric. 
It turns out that the L! metric is more appropriate for this purpose than 
an LP metric for p > 1, since it is relatively resistant to large deviations 
in the approximation error. (In view of the properties described in Sec- 
tion 2, it comes as no surprise to learn that the approximation error can 
change dramatically as we move from one point to another along a curved 
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boundary. ) 

While the integral metric approach is attractive, not least because it is 
well established in the context of nonparametric curve estimation, it does 
require care. For example, if the boundary is linear then we are still faced 
with the ill-posed problem of the effect of rational-versus-irrational slope. 
The remedy is to avoid linear boundaries altogether. Now, one way of 
characterising a nonlinear boundary is to insist that its second derivative 
never vanish. Therefore, in Section 7 we shall study the L! performance of 
local linear approximations to twice-differentiable boundaries that do not 
have any points of inflexion. 


4 Rational approximation by continued-fraction 
expansion 


In order to appreciate the methods and results for general boundary-estima- 
tion problems it is necessary to understand the main elements of the theory 
of rational approximation by continued fractions. We shall survey them 
here, referring the reader to Leveque (1956, Chapter 9) and Khintchine 
(1963) for more detailed discussion. Section 5 will make the connections to 
boundary estimation explicit. | 
A non-integer real number u may be uniquely expressed as a continued 

fraction, | 


uu = [a0; @1, @2,...| = Qo + na a nee, 
ai + g 
2 agt... 
where ag is an integer and a1,a9,... are strictly positive integers, called the 


partial denominators of u. The continued fraction expansion terminates if 
and only if u is rational. Up to the termination point (in the case of rational 
u), or for all n (if u is irrational), the convergents of u are defined to be 


the numbers 
1 
qo qı ai q2 ay + Pr 


where p,, and qn are relatively prime integers. The qn’s are strictly positive 
and form a strictly increasing sequence. By definition, Pn/qn converges to 
u as n — oo. The sequence of odd-indexed convergents is decreasing, and 
the sequence of even-indexed convergents increases. 

If u is irrational then the convergents provide a sequence of rational ap- 
proximations to u, often referred to as “continued fraction approximations”. 
The approximations are optimal in the sense that 


inf = |u— (p/q)| = [u — (pn /an)| - (1) 


Pp, 1<@<an 
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They also satisfy 


{an(n + Qnt1)} > < lu — (Pr/dn)| < (GnQnti)? (2) 


and 
if p and q are relatively prime, and |u — (p/q)| 


< (2q2)~* , then p/q is a convergent of u (3) 


The quality of approximations by continued fractions is determined mostly 
by properties of large elements of the sequence {an}, or equivalently by 
large values of qn+1/qn, Since it may be shown that qn+1/qn X an (meaning 
that the ratio of the left- and right-hand sides is bounded away from zero 
and infinity). 


5 Relationship between convergents and rates of 
approximation to linear boundaries 


The importance of continued fraction expansions to the problem of approx- 
imating linear boundaries, as defined in Section 1, is that the optimal rate 
of approximation (in the Hausdorff metric) to a line £ with irrational gra- 
dient u, using vertex colour data observed on an n x n grid, is essentially 
equivalent to n times the optimal rate at which we can approximate u by a 
rational number p/q with q not exceeding n. In view of properties (1)—(3) 
the latter rate is the order of 


{deim (U) dein lU) F , 


where k(n) = k(n, u) denotes the smallest k such that q,(u) < n. Call these 
results (R). A relationship between rational approximations and lines on 
square lattices is also expressed by Klein diagrams; see for example Klein 
(1907). 

To illustrate the importance of the connection between convergents and 
boundary approximation we consider a simple example. The real number 
u is said to be badly approximable (or BA, for short) if sup, an(u) < oo. 
The set of all BA numbers in the interval [0,1] has cardinality equal to 
that of the continuum (see e.g. Schmidt 1980, p. 23), but is of measure zero 
(e.g. Khintchine 1963, p. 69). All quadratic irrationals are BA, since for 
them the sequence {an} is eventually periodic. However, not all algebraic 
irrationals are BA. In view of the asymptotic equivalence of the sequences 
an and gni1/ dn, and the fact that qn is increasing, u is BA if and only if 
(k(n) Ve(n+1)) } is bounded between two constant multiples of n~?. 

From this result and (R) we see that an irrational number u is BA 
if and only if the optimal rate (in the Hausdorff metric) at which a line 


8 Peter Hall and Marc Raimondo 


with gradient u may be approximated from vertex colour data in ann x n 
section of the lattice, is n~!. As a corollary, the optimal rate is n~! when 
the gradient of £ is a quadratic irrational. 


6 A stochastic number-theoretic view 


Khintchine (1963), describing and developing work dating from the 1930’s 
(see e.g. Khintchine, 1935; Lévy, 1937), gave a concise account of rates of 
rational approximation to irrational numbers when the latter are chosen 
randomly with respect to Lebesgue measure. In view of the equivalence 
between problems of rational approximation and boundary approximation 
noted in Section 5, we may apply Khintchine’s results to our line estimation 
problem. 

To pose that problem in a stochastic setting we assume that the linear 
boundary is placed into the plane according to a random mechanism. For 
our purposes the mechanism may be defined very generally; we need only 
ask that the distribution of slope, conditional on the line’s intercept with 
any given axis, be continuous. This reflects the fact that, when the line has 
irrational gradient — which it will enjoy with probability 1 if the gradient 
has a continuous distribution — it is immaterial from the viewpoint of rates 
of approximation whether the line passes through a vertex. 

It is known, for example from Theorem 30 of Khintchine (1963), that if 
w(n) = n7} L(n) for a positive, slowly varying function L then, for almost all 
real numbers u (with respect to Lebesgue measure), Y(n) dn+i(u)/dn(u) = 
O(1) if and only if 


S ln) < oo: a 


and from Lévy (1937, p. 320) that n~! log qn(u) —> 1*/(12 log 2) as n — oo. 
Hence, for almost all u, p{log qn(u)} dn41(u)/gn(u) = O(1) if and only if 
(4) holds. 

This result, and the relationship between rational approximation and 
linear boundary approximation discussed in Section 5, may be used to 
show that if the boundary is stochastic in the sense defined in the previous 
section, then with probability one the optimal rate of approximation to the 
boundary, in the Hausdorff metric restricted to a region containing an nxn 
grid and using data from that grid, equals 


O{n (logn) L(logn)~*} (5) 


if and only if (4) holds. Similarly, it may be proved that with probability 
one (4) is equivalent to asking that the optimal rate of approximation be 
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no better than 
O{(n logn)~' L(logn)} (6) 
along any subsequence. 
The bound at (5) implies that if the line is placed into the plane at 


random according to the regime suggested above, then with probability 1 
it can be approximated at rate 


O{n~* (logn) (log log n) (log log log n)'**} , (7) 


for all e > 0, from vertex colour data within an n x n region of the square 
lattice; and the bound at (6) shows that the optimal convergence rate is no 
better than 


O{n 7} (logn)! (log log n)~1 (log log log n)~*~*} , 


along some subsequence. Moreover, these results are false if € is replaced 
by 0. 


7 Approximations to curved boundaries 


The results derived in Section 6 may be readily extended to the case of 
local linear approximations to smooth curves on a square lattice. There 
it is convenient to introduce the concept of a grid of increasing fineness, 
so as to develop a theory for curve estimation using increasing amounts 
of information. Rather than assume that the lattice has fixed edge width 
we suppose it has edge width n~!. For example, we might suppose that 
its vertices are at points (n~!i,n~17), where i, 7 range over the set of all 
integers. 

Replace £ by a smooth curve C, for example given by the equation 
y = g(x). As before, colour black the vertices above C and white the 
vertices below, and consider constructing a local linear approximation to 
g at x by employing the colours of all vertices that lie within the strip 
S = S(x) defined by {(t,y): 2-h<t< x+h and —oo < y < oo}. Here, h 
plays the role of bandwidth in more traditional curve estimation problems, 
and the asymptotics involve h = h(n) converging to zero as n — oo, in 
such a manner that nh — oo. 

We may define the local linear approximant, g(x), at x to be any straight- 
line segment that agrees with the vertex colour pattern within S(x); or 
any segment that has least number of disagreements, if no segment agrees 
completely. There are two sources of error in this approximation. First, 
there is a degree of bias, or systematic error, due to the fact that the 
part of C that lies within S is not exactly a straight line. As in more 
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familiar, second-order nonparametric curve estimation problems, the bias 
is O(h?) as h — 0 if g has two bounded derivatives. Secondly, there is 
approximation error arising from the fact that our only information about 
g is in the form of vertex colours. If the problem of estimating g(x) has a 
random component, for example if x is taken to be a random variable with 
an absolutely continuous distribution, then the results developed in Section 
6 for the case of a random line may be applied directly to the setting of 
approximating a random curve by a line segment within S. 

In particular, formula (7) may be used to bound the second type of 
approximation error, provided we replace n by nh and allow nh to increase 
without limit. Then, assuming that log (nh) increases like log n, which will 
certainly be the case for optimal choice of h, we see that the second type 
of error is bounded above by 


O{(n h) (log n)!F}. (8) 


Optimising the over-all convergence rate involves balancing systematic 
and non-systematic sources of error; that is, choosing h so that the bias 
term, of order hê, is of the same size as the quantity at (8). This means 
taking h to be of size {n~?(logn)!**}1/3, which gives a convergence rate 
of O[{n-? (log n)!**}2/3]. The rate n—4/3, multiplied by a positive power 
of (logn)~!, may be shown to be a minimax lower bound in this problem. 
In related work, Korostelev and Tsybakov (1993) have shown that the 
rate n~*/3 is minimax optimal in the case of certain random grids. Thus, 
the local linear approximation g is within at most a logarithmic factor of 
achieving the optimal rate. 


8 An Lı view of boundary approximation 


In the account of boundary approximation just above, we incorporated 
an element of randomness in order to remove the ill-posed nature of the 
problem. Without that randomness, the pointwise properties of rates of 
convergence defy simple description. Alternatively, we may address global 
rates of convergence in an L? metric. We know from the work in earlier 
sections that, while there are many pathological cases where convergence 
rates are arbitrarily poor (along subsequences), and while such cases arise at 
points forming a dense set, they have measure zero. Hence, we are entitled 
to expect that they will not loom excessively large in an LP measure of 
performance. Since the case p = 1 puts least emphasis on very large errors 
then it is potentially the most useful. 

Let C have equation y = g(x), and suppose we observe the vertex colour 
pattern at all vertices (in~!,jn7') for integers i,j with 0 < i < n and 
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—oo < j < oo. (Thus, we adopt the “increasingly fine grid” model sug- 
gested in Section 7.) Construct the local linear approximation proposed in 
Section 7, so that for each x in a compact interval Z (which we take with- 
out loss of generality to be [0,1]) we have an approximation g(x) to g(x) 
using vertex colour data within the strip S(x). Employing property (2) of 
convergents we may prove that if g has two bounded derivatives then, for 
absolute constants A;, Ag and A3, 


a(x) — g(x)| < Ath {dan (u) anay U} HBR, (9) 


for all x € T, 0 < h < $ and nh > Ag, where u = g'(x), N = N(u) is the 
largest integer such that qy(u) < Agnh, and B = supz|g”|. Here, qn(u) 
is the denominator of the n’th convergent, pn(u)/qn(u); see Section 4 for a 
definition. 

The second term on the right-hand side of (9) derives from bias, or 
equivalently from the systematic error induced by approximating a nonlin- 
ear curve by a short but linear segment. The first term results from the 
limited information available about g, in the form of vertex colours. That 
term can be arbitrarily large, owing to the sort of pathology noted at the 
end of Section 2. However, provided g” is bounded away from zero the inte- 
gral average of the first term is generally reasonable in size. In fact, it may 
be shown that if h = h(n) — 0 in such a manner that (nh)~*(log n)? — 0, 
then 


J 10) - 9(@)| de = Of(n*h) (logn)? +}, (10) 


uniformly in functions g for which, for some C > 1, 
C7! < inf |g" (x)| < sup |g"(x)| < C. (11) 
TEI rET 


The lower bound in (11) ensures that g is not too much like a straight line. 

Choosing h of size (n~! log n)?/3 in (10) we obtain a rate of approxima- 
tion, in the L! metric, of O{(n7} log n)4/3} uniformly in functions satisfying 
(11). Again, this convergence rate is close to the optimum of n~*/%; see 
Section 7. 

In principle, similar results may be derived for rates of approximation 
in LP metrics, where p > 1. However, those rates are inferior to the L! rate 
by a polynomial order of magnitude. The reason is that, for p > 1, the LP 
metric gives greater weight to larger values of the error |g(z) — g(x)|. 

To better appreciate the nature of this problem, observe from (9) that 
we have the bound 


g(x) — g(x)| < A143°h (nh) fan ujni(%)/an(u)(u)} + Bh?, 
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which is potentially the key to deriving formulae such as (10). However, 
the ratio Qn(u) = qN(u)+1(4)/4N(u) (u) is very unstable. Bear in mind that, 
when finding the integral average of the right-hand side, we are in effect 
taking U to be a random variable with the Uniform distribution on Z, and 
(in the case of the LP metric) asking that E{Q?(U)} be bounded. Now, it 
may be shown that the process {Q,(U), n > 1}, is Markovian, and that 
(while the process is itself not stationary) it has a stationary limit distri- 
bution. Therefore, Qn(U) = Op(1) as n — oo. However, the stationary 
distribution does not have any finite moments, and in fact E{Q?(U)} = oo 
for all n and all p > 1. The term (logn)? on the right-hand side of (10) is 
the result of taking a more subtle approach to this problem, necessary even 
in the case p = 1. 


9 Estimating boundaries observed with noise 


The noiseless model introduced in Section 7 may be written in the form 


Y(i/n,j/n) = I{j/n < gli/n)} , 


where I(-) is an indicator function, Y (i/n,j/n) denotes the colour of the 
vertex at (i/n,j/n) (white is represented by 1 and black by 0), and the 
equation y = g(x) represents the boundary C. In practice, due to a com- 
bination of systematic and stochastic errors, the colour of each vertex may 
be more appropriately represented by a number between —co and oo. In 
particular, we may write 


Y(i/n,j/n) = f(i/n,j/n) + éij, 


where f(-,-) is a function with a fault-type discontinuity along the curve 
y = g(x), and the independent and identically distributed stochastic errors 
€j have zero mean. 

It will be assumed that f admits the representation 


f(x,y) = fil, y) + fo(a,y) Ty < g(x)}, 


where fı and f2 each have two uniformly bounded derivatives of all types, 
and f is bounded away from zero. We suppose that g and its first two 
derivatives are bounded on Z. Several different analogues of the local linear 
estimators in Section 7 are possible; examples include versions based on 
least squares and on wavelets. We consider here only the former. It amounts 
to first computing a preliminary approximation, g, and then refining it using 
local linear smoothing within a window. We shall consider a particularly 
simple preliminary estimator, based on kernel methods, as follows. 


Measuring the performance of boundary-estimation methods 13 


Suppose we wish to estimate g at x € T. Write tîn for the integer 
nearest to nz, let K be a nonnegative, compactly supported, continuously 
differentiable function, let hy equal a constant multiple of n~2/%, and put 


T(j) = (nh) Y KHU — k)/(nhħ)}Y (in/n,j/n), 
k 


which is a statistical approximation to the first derivative of f(in/n,-) at 
j/n. Let î denote a value which produces a global maximum of |T| in the 
range Cyn < j < Con. Our preliminary estimator of g(x) is g(x) = j/n. 

Next we define an improved estimator. Let W be a square window of 
side length h = h(n), with its centre at (i,/n,j/n) and, for the sake of 
definiteness, its axes aligned with those of the grid. Temporarily make the 
assumption that within W, f assumes a constant value on either side of a 
line L. We fit £ by least squares in the class M(C, W) of all lines £ that 
divide W into two sets of vertices of which the larger has no more than 
C times the number in the smaller (where C > 1 is arbitrary but fixed). 
Specifically, let 74 [respectively, Z2] denote the set of vertex coordinates 
w = (i/n,j/n) in W that lie above [below] £, let 3s denote the sum of 
Y (w) over all w € T;, let Y; be the corresponding mean, and put 


2 , 
S(L) =)" Yu- Fy. 


Write Ê for a line that minimizes S(£) among all straight lines in M(C, W) 
that do not pass through any vertices. (The minimum is of course not 
uniquely attained, and any measurable approach to breaking ties is al- 
lowed.) Write g(x) for the ordinate of the point on Ê with abscissa z. 
Provided the distribution of the errors €;; has sufficiently light tails, it 
may be proved that g has properties similar to those ascribed to the local 
linear estimator g in the no-noise case in Sections 7 and 8. For example, let 
us assume that the moment generating function of the error distribution 
exists and is finite in a neighbourhood of the origin. If g” is bounded and 
X is a continuous random variable (stochastically independent of the errors 
€;;), then it may be shown that, for suitable choice of h, with probability 
one g(X) converges to g(X) at rate O{n~?(log n)!**}?/3 (compare Section 
7); and if |g”| is bounded away from both zero and infinity then, again for 
appropriate h, g converges to g in L! at rate O{(n~! logn)*/9}. 
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lute error estimators are surveyed and a number of extensions to related 
problems are suggested. A very elementary example is used to illustrate 
the basic approach of “interior point” algorithms for solving linear pro- 
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1 Why square errors? 


Gauss (1823), in what can only be admired as an epitome of “proof by 
intimidation” , defended his decision to minimize sums of squared errors in 
the following terms: 


It is by no means self evident how much loss should be assigned to a given 
observation error. On the contrary, the matter depends in some part on 
our own judgment. Clearly we cannot set the loss equal to the error 
itself; for if positive errors were taken as losses, negative errors would 
have to represent gains. The size of the loss is better represented by a 
function that is naturally positive. Since the number of such functions 
is infinite, it would seem that we should choose the simplest function 
having this property. That function is unarguably the square, and the 
principle proposed above results from its adoption. 
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Laplace has also considered the problem in a similar manner, but he 
adopted the absolute value of the error as his measure of loss. Now if I 
am not mistaken, this convention is no less arbitrary than mine. Should 
an error of double size be considered as tolerable as a single error twice 
repeated, or worse? Is it better to assign only twice as much influence 
to a double error or more? The answers are not self-evident, and the 
problem cannot be resolved by mathematical proofs, but only by an 
arbitrary decision. Moreover, it cannot be denied that Laplace’s conven- 
tion violates continuity and hence resists analytic treatment, while the 
results that my contention leads to are distinguished by their wonderful 
simplicity and generality. 


Despite the best efforts of such distinguished advocates as Laplace (1789), 
Edgeworth (1888), and Kolmogorov (1931), methods of estimation based 
on minimizing sums of absolute errors have languished in the shade of the 
edifice that Gauss built on the foundation of least squares. Why? There 
seem to be at least two elementary reasons. First, the “wonderful simplicity 
and generality” of squared error has produced an elegant statistical theory 
of the behavior of least squares estimators which, particularly in its finite- 
sample form for Gaussian cases, can only inspire awe and envy on the part 
of advocates of the quantile-esque methods of absolute errors. Some solace 
may be found in the very critical attack on least-squares based methods by 
the robustness movement launched by John Tukey in the 1940’s. The sec- 
ond, and perhaps even more damaging, is the perception that absolute error 
estimators are “difficult to compute.” To appreciate that this perception 
was perfectly valid at the end of the 19th century, one need only read a little 
of Edgeworth’s (1888) own arcane description of his geometric “algorithm” 
to compute the bivariate least absolute error regression estimator. 

With the advent of George Dantzig’s simplex algorithm in the late 1940’s 
this situation changed dramatically, and by the mid-50’s there were several 
formulations of the £; estimator for regression as a linear program and ex- 
plicit simplex-based programs to compute it. The paper of Wagner (1959) 
clarified the important role of the 41 dual problem. These efforts culminated 
in the algorithm of Barrodale and Roberts (1974) which still serves as the £4 
algorithm of choice for most statistical computing environments. Contrary 
to a plethora of dire warnings throughout the literature, about the diff- 
culty of £; computation this algorithm actually delivers least absolute error 
regression estimates faster than the corresponding least squares algorithms 
in many packages, including Splus and Stata, for problems of moderate 
size, up to a few hundred observations. However, for larger problems the 
Barrodale and Roberts algorithm exhibits O(n?) growth in execution time, 
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and thus quickly lives up to its slothful reputation. Portnoy (1991) pro- 
vides a detailed probabilistic complexity analysis for the simplex version 
of the parametric quantile regression problem which sheds some light on 
the theoretical rationale for the observed O,(n*) behavior of the simplex 
approach. In Portnoy and Koenker (1997), we have shown that combining 
recent developments on interior point methods for solving linear programs 
with careful preprocessing can improve both the theoretical and practical 
performance of £; regression computations to the point that they are com- 
petitive with least squares over the entire range of contemporary problem 
dimensions. 

In this paper I will briefly review these recent developments in 44 com- 
putation and then sketch some ideas for extending these developments into 
a broader range of related problems in statistics. 


2 Means vs. medians 


The most elementary instance of our basic problem may be posed as the 
simple question: Which is easier to compute, the median or the mean’? 
Surprisingly, the question is deceptively difficult. At the most naive level, 
it would be immediately recognized that the median has an advantage for 
computation “by hand”, an attribute implicit in the “median-polish” algo- 
rithms suggested by Tukey for robust ANOVA. Somewhat less naively, with 
modern computers in mind we might contemplate computing the mean in 
O(n) elementary operations ( n additions, and one division), while the me- 
dian appears to require sorting n numbers, a task which requires O(n log n) 
comparisons. Further reflection suggests, however, that the median may not 
actually require a full sorting of the observations; a cleverly chosen partial 
sorting may suffice. Considerable further reflection yields the celebrated al- 
gorithm of Floyd and Rivest (1975), which manages to compute the median 
in O(n) comparisons. At this point we require a more delicate comparison 
of the relative effort of additions and comparisons and the precise constants 
associated with the O(n) median algorithm. Since such delicacy seems in- 
herently machine dependent to some degree, we will not attempt to pursue 
it further here, but will simply note that it is not implausible that a sophis- 
tzcated algorithm for the median could, for n sufficiently large, outperform 
the computation of the mean, thus restoring the superiority of the median. 


3 Simplex for median regression 


Portnoy and Koenker (1997) reconsider the problem of solving the 41 re- 
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gression problem, 


i mae. 1 
min 2 ly zb (1) 


which may be formulated as the linear program, 
min{ eute'v | y= Xb+u -— v, (u,v) E Re}. (2) 
This problem has the dual formulation 
max{y'd | X'd=0, de [-1,1]"}, (3) 
or equivalently, setting a = d+ te, 


max{y'a | X’'a = 4X'e, a € (0,1]"}. (4) 
The simplex approach to solving this problem may be briefly described 
as follows. A p-element subset of N = {1,2,...,n} will be denoted by h, 
and X(h),y(h) will denote the submatrix and subvector of X,y with the 
corresponding rows and elements identified by h. Recognizing that solutions 
to (1) may be characterized as planes which pass through precisely p = 
dim(b) observations, or as convex combinations of such “basic” solutions, 
we can begin with any such solution, which we may write as, 


b(h) = X(h)~*y(h). (5) 


We may regard any such “basic” primal solution as an extreme point 
of the polyhedral, convex constraint set. A natural algorithmic strategy is 
then to move to the adjacent vertex of the constraint set in the direction 
of steepest descent. This transition involves two stages: the first chooses 
a descent direction by considering the removal of each of the current basic 
observations and computing the gradient in the resulting direction, then 
having selected the direction of steepest descent and thus an observation 
to be removed from the currently active “basic” set, we must find the max- 
imal step length in the chosen direction by searching over the remaining 
n — p available observations for a new element to introduce into the “basic” 
set. Each of these transitions involves an elementary “simplex pivot” ma- 
trix operation to update the current basis. The iteration continues in this 
manner until no direction is found at which point the current b(h) can be 
declared optimal. 

Sheynin (1973) has noted that Gauss was already aware in 1809 that 
minimizing absolute errors, as suggested by Boscovich and Laplace, entailed 
this “zero residual” property. It is therefore tempting to speculate on why 
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it required another 150 years to develop the “wonderfully simple” idea of 
moving from vertex to vertex in the direction of steepest descent. One 
possible explanation for this involves the distinction made by Gill, Murray 
and Wright (1991) between iterative and direct algorithms. As they put it, 


...we consider as direct a computation procedure that produces one and 
only one estimate of the solution, without needing to perform a poste- 
riort tests to verify that the solution has been found...In contrast, an 
iterative method generates a sequence of trial estimates of the solution, 
called iterates. An iterative method includes several elements: an ini- 
tial estimate of the solution; computable tests to verify whether or not 
an alleged solution is correct; and a procedure for generating the next 
iterate in the sequence if the current estimate fails the test. 


Thus, the iterative nature of the simplex algorithm makes it rather like a 
voyage of exploration of the 15th century, sailing into the Atlantic believing 
that the world was flat, not knowing when, or even if, the voyage would 
end. Gaussian elimination, on the other hand, made least squares like a trip 

along a familiar road; at each step one knew exactly how much further effort 
was necessary. With the emergence of computers in the 1940’s, the risk, or 
uncertainty, of the iterative approach was transferable to the machine, and 
the spirit of adventure blossomed as investigators put down their pencils 
and watched the tapes whir and the lights flicker. 

Like the advantage of the median over the mean for hand computations, 
the simplex algorithm for median regression performs extremely well rela- 
tive to least squares in problems of modest size. In Figure 1 we can compare 
performance of the Barrodale and Roberts (1974) algorithm for median re- 
gression with the standard least squares algorithm as implemented in the 
function lm(y ~ x) in Splus. For p = 4 and n < 2000, median regression 
a la Barrodale and Roberts is actually faster than the corresponding least 
squares computation. As the dimension of the parameter increases, the 
advantage of @; over Lz is somewhat attentuated, but even with p = 16, 
there is still an advantage up to sample sizes of about 400. 

In larger problems simplex-based computations for median regression 
pale in comparison with speeds achievable by least squares. In Figure 2 I il- 
lustrate this comparison over problems in the range 20,000 < n < 120,000, 
and we see that the time required for the modified simplex approach em- 
bodied in the Barrodale and Roberts algorithm tends to grow quadratically 
in n while the QR factorization approach of 1m grows only linearly in n. By 
sample size n = 120,000 this results in computations of nearly one hour for 
median regression a procedure which can be carried out in 10-20 seconds 
by least squares. Is this differential inherent in the linear programming 
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formulation of the 2; problem, confirming Gauss’s claims, or is it simplex 
that is at fault? 
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Figure 1: Timing comparison of £; and l algorithms: Times are in seconds for 
the median of five replications for iid Gaussian data. The parametric dimension 
of the models is p+ 1 with p indicated above each plot, p columns are gener- 
ated randomly and an intercept parameter is appended to the resulting design. 
Timings were made at 8 design points in n: 200, 400, 800, 1200, 2000, 4000, 
8000, 12000. The solid line represents the results for the simplex-based Barrodale 
and Roberts algorithm, 11fit(x,y) in Splus, and the dotted line represents least 
squares timings based on Im(y ~ x). 


Ironically, one of the great research challenges of numerical analysis of 
recent decades has been, “Why is simplex so quick?” Examples, notably 
that of Klee and Minty (1972), have shown that in problems of dimension, n, 
simplex can take as many as 2” simplex pivots, each requiring O(n) effort. 
From this perspective Op(n?) effort for randomly generated ¢; problems 
appears to be quite brilliant. The recent paper of Shamir (1993) surveys 
the extensive literature exploring this gap between theoretical worst-case 
behavior and practical average-case performance. The discussion of simplex 
in Gill, Murray and Wright (1991) is especially good on this aspect. 


4 Newton to the max: An elementary example 


To illustrate the shortcomings of the simplex method, or indeed of any 
strategy for solving linear programs which relies on an iterative path along 
the exterior of the constraint set, consider the problem depicted in Figure 
3. We have a random polygon whose vertices lie on the unit circle and our 
objective is to find a point in the polygon that maximizes the sum of its 
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coordinates, that is, the point furthest north-east in the figure. 
Since any point in the polygon can be represented as a convex weighting 
of the extreme points, the problem may be formulated as 


max{e'u|X’d=u, ed=1, dE Ri}, (6) 


where e denotes a (conformable) vector of ones, X is an n x 2 matrix 
with rows representing the n vertices of the polygon and d is the vector 
of convex weights to be determined. Eliminating u we may rewrite (6) 
somewhat more simply as 


max{s'd|ed = 1, dE RT}, (7) 


where s = Xe. This is an extremely simple linear program which serves as 
a convenient geometric laboratory animal for studying various approaches 
to solving such problems. Simplex is particularly simple in this context, 
because the constraint set 2s literally a simplex. If we begin at a random 
vertex, and move around the polygon until optimality is achieved, we pass 
through O(n) vertices in the process. Of course, a random initial vertex 
is rather naive, and one could do much better with an intelligent “Phase 
1” approach that found a good initial vertex. In effect we can think of the 
“interior point” approach we will now describe as a class of methods to 
accomplish this, rendering unnecessary further travel around the outside of 
the polygon. 

Although prior work in the Soviet literature offered theoretical support 
for the idea that linear programs could be solved in polynomial time, thus 
avoiding the pathological exponential growth of the Klee-Minty examples, 
the paper of Karmarker (1984) constituted a watershed in the numerical 
analysis of linear programming. It offered not only a cogent argument 
for the polynomiality of interior point methods of solving LP’s, but also 
provided for the first time direct evidence that interior point methods were 
demonstrably faster than simplex in specific, large, practical problems. 

To explore several variants of interior point methods we will use our 
simple polygonal problem. Further details about more general LP’s and 
applications to 4 regression, and quantile regression more generally, may 
be found in Portnoy and Koenker (1997). The basic approach we will 
describe to interior point methods for linear programming is set out in the 
important survey paper by Lustig, Marsden and Shanno (1994). A more 
detailed exposition may be found in the new monograph of Wright (1996) 


22 Roger Koenker 


5000 


500 1000 
500 1000 


500 1000 


seconds 
50 100 
seconds 
50 100 
seconds 


50 100 


20 40 60 80 100 120 


n in thousands n in thousands n in thousands 


Figure 2: Timing comparison of 44 and l2 algorithms: Times are in seconds for 
the median of five replications for iid Gaussian data. The parametric dimension 
of the models is p + 1 with p indicated above each plot, p columns are generated 
randomly and an intercept parameter is appended to the resulting design. Timings 
were made at 4 design points in n: 20,000, 40,000, 80,000, 120,000. The solid line 
represents the results for the simplex-based Barrodale and Roberts algorithm, 
lifit(x,y) in Splus, and the dotted line represents least squares timings based 
on lm(y ~ x). 


It is an amusing irony, illustrating the spasmodic progress of science, 
that the most fruitful practical formulation of the interior point revolution 
of Karmarker (1984) can be traced back to a series of Oslo working papers 
by Ragnar Frisch in the early 1950’s. This work is summarized in Frisch 
(1956), and was considerably elaborated and extended in the monograph 
of Fiacco and McCormick (1968). The connection between Karmarker’s 
approach and the earlier literature was developed in Gill, Murray, Saunders, 
Tomlin and Wright (1986). The basic idea of Frisch was to replace the linear 
inequality constraints of the LP, by what may be called a log barrier, or 
potential, function. Thus, in our example, we may reformulate (7) as, 


max{s'd + uy log d;|e’d = 1} (8) 
i=1 


where now the barrier term u` log d; serves as a penalty which keeps us 
away from the boundary of the positive orthant. By judicious choice of a 
sequence u — 0 we might hope to converge to a solution of the original 
problem. 

The salient virtue of the log barrier formulation is that, unlike the orig- 
inal formulation, it yields a differentiable objective function which is con- 
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sequently attackable by Newton’s method. Restricting attention, for the 
moment, to the primal log-barrier formulation 8 and defining, 


B(d,u) = sd + uY log d; (9) 


we have VB =s+yD~'e and V7B = —yD~? where D = diag(d). Thus, 
at any initial feasible, d, we have the associated Newton subproblem 


max{(s + pD~*e)'p — 5p'D~ple'p = O}. 
This problem has first order conditions 


s+pD te-uD~*p = ae 
ep = 0 


and multiplying through by e’D?, and using the constraint, we have, 
e'D*s + pe! De = ae'D?e. 
Thus solving for the Lagrange multiplier â we obtain the Newton direction 
p=p *D?*s + De — Ge (10) 


where â = (e’D7e)—'(e'D*s + pe’De) . Pursuing the iteration d — d+ Ap, 
thus defined, with u fixed until convergence, yields the central path d() 
which describes a yellow brick road to the solution d* of the original problem 
(6). We must be careful to keep the step lengths A small enough to maintain 
the interior feasibility of d. Note that the initial feasible point d = e/n 
represents d(0o). 

As emphasized by Gonzaga (1992) and others, this central path is a cru- 
cial construct for the interior point approach. Algorithms may be usefully 
evaluated on the basis of how well they are able to follow this path. Clearly, 
there is some tradeoff between staying close to the path and moving along 
the path, thus trying to reduce p, iteration by iteration. Improving upon 
existing techniques for balancing these objectives is the subject of a vast 
outpouring of current research. Excellent introductions to the subject are 
provided in the survey paper of Margaret Wright (1992) and the recent 
monograph of Stephen Wright (1996). 

Thus far, we have considered only the primal version of our simple polyg- 
onal problem, but it is also advantageous to consider the primal and dual 
forms together. The dual of (7) is very simple: 


min{alea—z=s, z >O}. (11) 


24 Roger Koenker 


The scalar, a, is the Lagrange multiplier on the equality constraint of the 
primal introduced above, while z is a vector of “residuals,” or slack variables 
in the terminology of linear programming. This formulation of the dual 
exposes the real triviality of the problem — we are simply looking for the 
maximal element of the vector s = Xe. This is a very special case of 
the linear programming formulation of finding any ordinary quantile. But 
the latter would require us to split z into its positive and negative parts, 
and would also introduce upper bounds on the variables, d, in the primal 
problem. 

Another way to express the central path, one that nicely illuminates the 
symmetric roles of the primal and dual formulations of the original problem, 
is to solve the equations, 


ed = 1 
ea—~z = s (12) 
Dz = ue. 


That solving these equations is equivalent to solving (8) may be immedi- 
ately seen by writing the first order conditions for (8) as 


ed = 1 
ea—-pD te = s, 


and then appending the definition z = »D~*te. The equivalence then follows 
from the negative definiteness of the Hessian V?B. This formulation is also 
useful in highlighting a crucial interpretation of the log-barrier penalty 
parameter, u. For any feasible pair (z,d) we have 


sd=a-— zd, 


so z'd is equal to the duality gap, the discrepancy between the primal and 
dual objective functions at the point (z,d). At a solution, we have the 
complementary slackness condition z’d = 0, thus implying a duality gap of 
zero. Multiplying through by e’ in the last equation of (12) , we may take 
u = 2'd/n as a direct measure of progress toward a solution. 

Applying Newton’s method to these equations yields 


Z 0D Pa pe — Dz 
e! 0 0 Pa = 0 ; 
0 e I Dz 0 


where we have again presumed initial, feasible choices of d and z. Solving 


for Pa we have 
Ba = (€e Z7! De)'e'Z~* (Dz — pe) 
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which yields the primal-dual Newton direction: 


pa = Z7(ue — Dz — Depa) (13) 
Dz = epa. (14) 


It is of obvious interest to compare this primal-dual direction with the 
purely primal step derived above. In order to do so, however, we need to 
specify an adjustment mechanism for p. 

To this end we will now describe an approach suggested by Mehrotra 
(1992) that has been widely implemented by developers of interior point 
algorithms, including the interior point algorithm for quantile regression 
described in Portnoy and Koenker (1997). Given an initial feasible triple 
(d,a,z), consider the affine-scaling Newton direction obtained by evaluating 
(13) at u = 0. Now compute the step lengths for the primal and dual 
variables respectively using 


Aq = argmaz{A € [0,1]|d + Apa => 0} 
Az = argmaz{ A € (0, 1]|z + Ap, > 0}. 


But rather than precipitously taking this step, Mehrotra suggests adapting 
the direction somewhat to account for both the “recentering effect” intro- 
duced by the pe term in (13) and also for the nonlinearity introduced by 
the last of the first order conditions. 

Consider first the recentering effect. If we contemplate taking a full step 
in the affine scaling direction we would have, 


jt = (d + Aapa) (z + Àzpz)/n, 
while at the current point we have, 
p= dz/n. 


Now, if f is considerably smaller than pu, it means that the affine scaling 
direction has brought us considerably closer to the optimality condition of 
complementary slackness: z’d = 0. This suggests that the affine scaling 
direction is favorable, that we should reduce u, in effect downplaying the 
contribution of the recentering term in the gradient. If, on the other hand, 
jz isn’t much different than p, it suggests that the affine-scaling direction is 
unfavorable and that we should leave u alone, taking a step which attempts 
to bring us back closer to the central path. Repeated Newton steps with 
u fixed put us exactly on this path. These heuristics are embodied in 
Mehrotra’s proposal to update u by 


u — pl(fi/p)?. 
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To deal with the nonlinearity, Mehrotra (1992) proposed the following 
“predictor-corrector” approach. A full affine scaling step would entail 


(d+ pg) (z+pz) = d'z + d'pz + pyz t+ padz- 


The linearization implicit in the Newton step ignores the last term, in effect 
predicting that since it is of O(u?) it can be neglected. But since we have 
already computed a preliminary direction, we might as well reintroduce this 
term to correct for the nonlinearity as well to accomplish the recentering. 
Thus, we compute the modified direction by solving 


Z 0D Ôd e — Dz — Papz 
e! 0 0 Oa = 0 ) 
0 e I Oz 0 


where Py = diag(pq). This modified Newton direction is then subjected 
to the same step-length computation and a step is finally taken. It is 
important in more realistic problem settings that the linear algebra required 
to compute the solution to the modified step has already been done for the 
affine scaling step. This usually entails a Cholesky factorization of a matrix 
which happens to be scalar here, so the modified step can be computed by 
simply backsolving the same system of linear equations already factored to 
compute the affine scaling step. 

In Figure 3 we provide an example intended to illustrate the advantage 
of the Mehrotra modified step. The solid line indicates the central path. 
Starting from the same initial point d = e/n, the dotted line represents the 
first affine scaling step. It is successful in the limited sense that it stays 
very close to the central path, but it only takes a short step toward our final 
destination. In contrast, the first modified step, indicated by the dashed 
line, takes us much further. By anticipating the curvature of the central 
path, it takes a step more than twice the length of the unmodified, affine- 
scaling step. On the second step the initial affine-scaling step is almost on 
target, but again somewhat short of the mark. The modified step is more 
accurately pointed at the desired vertex and is thus, again, able to take a 
longer step. 

It is difficult in a single example like this to convey a sense of the overall 
performance of these methods. After viewing a large number of realiza- 
tions of these examples myself, I come away convinced that the Mehrotra 
modified step consistently improves upon the affine scaling step, a finding 
that is completely consistent with the theory. 
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Figure 3: A simple example of interior point methods for linear programming: 
The figure illustrates a random pentagon of which we would like to find the most 
northeast vertex. The central path beginning with an equal weighting of the 5 
extreme points of the polygon is shown as the solid curved line. The dotted line 
emanating from the this center is the first affine scaling step. The dashed line is 
the modified Newton direction computed according to the proposal of Mehrotra. 
Subsequent iterations are unfortunately obscured by the scale of the figure. 


In Portnoy and Koenker (1997), we noted that recent work on the prob- 
abilistic analysis of the computational complexity of interior point meth- 
ods suggests that algorithms with O,(np*log* n) operations are possible 
for quantile regression with n observations and p parameters. While such 
performance is considerably better, in large problems, than the observed 
O,(n*p) performance of simplex, it is still inferior to the O(np*) complex- 
ity of least squares. In the next section I very briefly describe a prepro- 
cessing strategy for quantile regression problems that has been successful 
in further narrowing this computational gap. 
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5 Preprocessing for quantile regression 


The idea of preprocessing quantile regression problems described in Portnoy 
and Koenker (1997) actually preceded the implementation of the interior 
point methods discussed above. Preprocessing rests on an extremely sim- 
ple idea: if, by preliminary estimation, or some other form of statistical 
necromancy, we could determine the signs of a significant group of obser- 
vations, we could then combine observations with positive residuals into 
a single “globbed” observation, and similarly glob together the negative 
observations, so that the original problem, 


min > p+ (yi — afd) (15) 
i=l 
with p,;(u) = u(t — I(u < 0)) would be equivalent to, 


min Š, pr(ys— zb) + pr(yx — 2b) + pry — 2b) (16) 
iEN\JLUJH 


where N = {1,2,...,n}, £K = Vic, Ti for K € {K,L} and yz and yy 
can be chosen arbitrarily small and large respectively, to ensure that the 
corresponding residuals on the globbed observations remain negative and 
positive. In this process we have reduced the problem of n original obser- 
vations to n — H{Jz, Jg } + 2 observations so if the cardinality of the J-sets 
is large we have gained substantially. Under plausible sampling assump- 
tions we can, based on a preliminary subsample of m observations, make a 
prediction region for {x;ß : i = 1,2,... n} of width O(p/./m), so assigning 
observations above this region to Jy and observations below this region to 
Jr, we would have M = O,(np/,/m) observations falling inside the region. 
This is illustrated in Figure 4. 

Minimizing the computational effort required to compute the prelimi- 
nary fit based on m observations plus the effort required for the solution of 
the globbed problem (16) with M observations, we obtain m* = O((np)?/3), 
which under our conjectured performance of the underlying interior point 
algorithm yields a complexity for the full problem of 


C = Op(n? p log? n) + O(np?), (17) 


where the first term comes from the solution of the two quantile regression 
problems and the second term arises from the computation of the confidence 


band. 
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Figure 4: A bivariate example of quantile regression preprocessing: The figure 
illustrates a bivariate scatter plot of 500 observations with y conditionally student 
t on 10 degrees of freedom. The curved dotted lines describe a confidence band for 
the response variable based on the median regression fit for a sub-sample of 126 
observations. After globbing there are only 107 observations, including the two 
globbed observations. All the points outside the band are collapsed into this pair 
of pseudo-observations. The fit to the globbed sample is indicated by the solid 
line; since it falls inside the band we are assured that the globs are correct and 
that this solution is identical to a fit of the entire original sample. 


Further details are provided in Portnoy and Koenker (1997) and I will 
comment only briefly here on the important fact that any implementation 
of this preprocessing approach must verify that the solution to the globbed 
problem actually vindicates the predicted signs based on the confidence 
region. Since the simultaneous confidence region can be chosen to assure 
this with arbitrarily high probability, the eventuality that we may need 
to repeat the cycle to remedy some inaccurately predicted signs introduces 
another multiplicative factor which does not affect the orders in probability 
in the complexity computation. 

The crucial consequence of the formal complexity theory and the exten- 
sive concomitant empirical testing of our implementation of the algorithm is 
that the computational effort required for quantile regression can be made 
comparable with the effort required for least squares over the full range of 
currently plausible problem dimensions. In the final empirical example of 
Portnoy and Koenker (1997), we compare timings for a typical large econo- 
metric application of quantile regression with n = 113, 547 andp = 6. With 
the new algorithm, quantile regression estimates take about 10 seconds on 
a Sparc-Ultra, comparable to the least squares time of 8 seconds. Simplex 
solution of the same quantile regression problems requires approximately 
an hour on the same machine. 
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6 Prospects 


There are many open questions posed by the rapid development of com- 
putational methods for quantile regression. I would like to touch on three 
topics in this brief final section. The first is applications to inference and the 
general problem of parametric programming viewed in the light of interior 
point methods. The second is applications to nonlinear quantile regression. 
And the third concerns nonparametric applications of quantile regression. 

As I have tried to emphasize elsewhere, an important virtue of the sim- 
plex approach to 41-type computation is the direct transition to parametric 
programming, or sensitivity analysis. Having obtained a solution at one 
quantile we immediately compute an interval of optimality for this solu- 
tion, at the endpoints the solution alters. We may then make a simplex 
pivot which takes us to an adjacent vertex of the constraint set; continuing 
this process traces out the entire path of solutions to the problem (15) for 
rT € [0,1]. Efficient computation of the quantile regression process is cru- 
cial for the smooth L-statistics described in Koenker and Portnoy (1990), 
and the corresponding dual process is central to the elegant theory of rank 
statistics introduced by Gutenbrunner and Jureckova (1992). Very simi- 
lar computations are required to compute the penalized quantile regression 
spline estimators introduced in Koenker, Ng and Portnoy (1995) where the 
degree of smoothing (bandwidth) parameter A plays the role of T. 

The homotopy methods of interior point algorithms also lend themselves 
naturally to parametric programming. In large problem it may be sufficient 
to compute solutions on some grid in T or A and we may thus avoid passing 
through all the intermediate vertices by tunnelling through the interior 
of the constraint set, passing directly from one grid point to the next. 
Algorithms to do this are conceptually straightforward, given the existing 
research, see for example Monteiro and Mehrotra (1995), but they require 
some careful engineering. | 

Non-linear quantile regression, that is quantile regression estimation like 
(15) with a nonlinear response function replacing the linear predictor 2/6, 
are increasingly common in applications. Here too, interior point methods 
and the preprocessing approaches described above can play a useful role. 
Some ideas along this line have been already described in Koenker and Park 
(1996). There is, however, considerable scope for refinement. 

Finally, in nonparametric applications of quantile regression there are a 
wide array of competing methods, all of which can profit from more efficient 
computational methods for large data sets. This is particularly true of the 
quantile smoothing spline approach of Koenker, Ng and Portnoy (1995), 
which offers new challenges in terms of exploiting sparsity in the interior 
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point matrix computations. This is a topic which has received intense 
scrutiny in the interior point literature, and there are a number of very 
promising approaches already available. 

We are, I believe, on the verge of overthrowing the long-standing com- 
putational disability of 2; methods. In the next century, we may hope that 
the young statistician looking for improved robustness, or simply for a more 
complete view of her data, may say of quantile regression, echoing Molly 
Bloom, “...yes I said yes I will Yes.” 
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Abstract: This article deals with a family of implicitly defined rank statis- 
tics, which are designed to make inference on general linear hypotheses 
in a large class of nonparametric extensions of the classical linear model. 
The new rank statistics are defined via the solutions of a continuous fam- 
ily of minimization problems. For simple designs, the procedure leads to 
the classical rank statistics. 
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1 Introduction 


For a given known c.d.f. Fo with continuous positive density fo and finite 
second moment, let us first consider the classical parametric linear model 


M?*(Fo):¥i ~ Poft- m), bi = BX (1) 


where Y;, 1 < i < n are independent responses and the vectors x; represent 
design conditions and covariables (we assume that the first component 2; of 
the x; is 1 corresponding to the intercept and denote with X = (x1,... xa) 
the design matrix). Usually, in such models one is interested in linear 


hypotheses of the form 
HE™ (Fo) : CB = 0. (2) 


It is well known, that this model is not invariant w.r.t. nonlinear increasing 
transformations of the response, that is, if m(t) is a nonlinear increasing 
function, then the transformed responses m(Y;) in general do not follow 
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a linear model of the form described above (in the following, we use the 
terms invariant or ordinal invariant as short terms for invariant against 
strictly increasing transformations). On the other hand, in practice it often 
is obvious, that the observed responses do not come from a linear model 
(1), but there might be an unknown nonlinear monotone transformation, 
such that the transformed observations do. 

The smallest invariant model containing M'*"(F ) is the semiparametric 
transformation model 


MSP*"(Fo):¥; ~ Fo(a(t)— mi), i = P xi, (3) 


where the unknown increasing function a(t) introduces a nonparametric 
component. p = (p1,... bin) is still identifiable up to an intercept, i.e. a 
constant multiple of 1,. The null hypothesis (2) in the context of model 
M®?t( Fo) will be denoted with HP?" (Fo). 

Rank statistics for inference within this kind of model were studied 
for example in Pettitt (1982, 1983, 1987), Doksum (1987), Cuzick (1988), 
Tsukahara (1992), see also Bickel et al. (1993) and Chauduri et al. (1994). 
They are based on different approximations of Hoeffding’s formula (Hoeffd- 
ing, 1951) for the partial or marginal likelihood of the ranks, 


n! IPg(R(Y) =2(T] ae (4) 
i=l fo (Fo (Uny) 
in which Una is the l-th order statistic in a sample of n i.i.d. variables 
Uni,---,;Unn distributed uniformly on (0,1), and r = (rj,...,Tn) . 

Note, that in (4) only 8 is unknown, that is, reduction to the maximally 
invariant vector of ranks leads to a parametric likelihood, which however in 
general is numerically difficult. One prominent counter-example is the pro- 
portional hazards model (Fo the extreme value c.d.f.), in which (4) becomes 
Cox’s (1972) partial likelihood (for the case without censoring or ties) 


IT exp(@ xq) 
i= = exp(3 x(;)) | 
where x,,;) corresponds to the i-th largest response. 

Pettitt treated approximations to (4) for Fo normal, whereas Doksum 
and Cuzick used different approximations for the general case, which how- 
ever still are numerical quite difficult. For the Fo = ® (normal) case, 
which is one of the cases with an explicit expression for the expectation of 


the ranks, Pettitt also proposed an alternative to maximizing a likelihood, 
namely to minimize 


> [Ri — Ee(Ri))?, with JEg(R;) = 1+ 5° ®[(x; — x;) B/Vv2]. 
i jżi 


n! IPs(R(Y) = r) = 
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2 Statistical model 


In the present paper, we shall generalize (3) to a nonparametric model 
MNPar( Fo), which may be regarded as the synthesis of (3) with a non- 
parametric model proposed by Akritas and Arnold (1994) in the case of X 
representing a factorial design. The extension compared to (4) consists in 
replacing inside Fo the functions u(t) = a(t) — B x; with the more general 
functions ;(t) = x,3(t) where A(t) is any smooth function of t, such that 
the i(t) are strictly increasing: 


MNP (Fo): Y; ~ Fo(x,8(t)). (5) 


The semiparametric transformation model (3) is contained as special case 
B(t) = a(t)e,; — B, the Akritas-Arnold (1994) model is obtained from (5) 
by choosing Fo(t) = t, the c.d.f. of the uniform distribution. However this 
latter choice does not satisfy our requirement that the density fo should 
be positive on the whole real line. This causes the compatibility conditions 
(8)-(11) to fail for the Akritas-Arnold model. 

Introducing the link function h = Tae. and denoting with F; the c.d.f. 
of Y;, ho F = (hoF\,...,ho Fy) , we may state (5) also in the form that 
ho F must be continuously differentiable and satisfy for all t € JR a linear 
condition 


ho F(t) € L = XIR = {XB | 8 € IRP}. 
A natural extension of the linear hypothesis (2) is given by 
HÈP™( Fo) : CE) =O forall t, or equivalently (6) 
Hot?" (Fo) : ho F(t) € Lo = {XG | B € RP,CB = 0} for all t. (7) 
These definitions imply the compatibility properties 


ME (Fo) a MP?" Fy) C MNP ar( Fo), (8) 

ey) z M5Par( F) N Hy ho): (9) 
Ho” (Fo) = M'™ (Fo) N HEP (Fo) (10) 

M**" (Fo) N HY” (Fo) (11) 


The models presented above should not be confused with the semiparamet- 
ric shift model M®°* = Up er M™™ (Fo), F a set of real c.d.f., i.e. 


MSSE Y, ~ F(t— mi), F € F unknown. (12) 


This model is not invariant in our sense (w.r.t. monotone transformations). 
It is closely related to a large class of rank tests, namely the procedures 
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collected under the name “ranking after alignment” (overviews in Adichie 
(1984), Puri and Sen (1985), further references see there). 

For example, in the k-sample problem, MN??"( Fo) is larger then MSShft 
(requiring only groupwise identical distribution but not the assumption that 
the between-group differences just shift the c.d.f.), whereas the hypothesis 


HXP (Fo) : Bo(t) =... = p(t) = 0 for all t 


is identical with the classical hypothesis of no group differences. 

Besides the property of being an invariant extension of the parametric 
linear model, M°??"(Fo) and MNP (Fù) have another interesting feature: 
They are characterized by the property, that any discretization of the re- 
sponse Y leads to the class of discrete ordinal regression models commonly 
named “cumulative logit” models (but the link function need not be the 
logit link, it is just h.). MP (Fo) corresponds to the model with the “par- 
allel regression lines” assumption (which is tested e.g. by SAS procedure 
LOGISTIC). This model was considered e.g. by McCullagh (1980) and An- 
derson and Philips (1981). M?@(Fo) corresponds to the model without 
parallel regression lines assumption, which was considered e.g. by Williams 
and Grizzle (1972). 


Example 1 Let us consider the 2 x 2 factorial design with a continuous 
covariate x. The nonparametric transformation model (5) in this case is 


IP(Yije < t) = Fola(t) + yut) + yz) + vag (t) + a(t) age), 
Q, Yii, Vj, Bij, Y4 smooth functions of t, and 
Ho : Ysi; = 0 OT Ho : ya =Q. 


Within the semiparametric transformation model, only a may depend on 
t, within the parametric submodel a(t) must be a linear function of t. The 
Akritas-Arnold-model corresponds to Fo(t) = t on (0,1). 


3 Regression rank statistics 


We define the regression rank score B(Y;) of response Y; by the nonlinear 
regression equation 


5. Fo(x,B(¥))x; = Sr e(¥i — Ys) (13) 
j=1 


where c(t) = 0.57 |[t = 0] + I[t > O}. 
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Proposition 1 Jf fo is positive on IR and has finite expectation, the solu- 
tion of (13) exists and is unique. 


Proof: The solution characterizes the minimum of the strictly convex func- 


tion of 8: i 
G:(6) = »( 


j=l 
which is bounded from below because f |t|d Fo(t) < œ. M 


x, 3 
Ji roe) - (Y: - Yat), (14) 


h(0) 


The solution might be infinite if there exists ys € L such that sign(u;) = 
2c(Y;—Y;)—1 for all j, that is, the x; corresponding to Y; < Y; are separated 
from those corresponding to Y; > Y; by some hyperplane {x € JR? | x B* = 
0}. We may avoid infinite solutions by replacing Fo with [(n+1)Fo—0.5]/n 
in (13) and (14). 

Only in the special case of h being the logit link, (13) defines the max- 
imum likelihood estimator. In the one sample case (Y; i.i.d., x; = 1), (13) 
reduces to 


nF(A(W)) = Se: -Y;) = ROW), hence 


B(¥:) = A(R(¥i)/n), R(Y;) = rank of Y;. 
In the k-sample case, B(Y;) is the vector of rank scores of Y; within the k 


samples, 

B(Y;) = (n] 2] nf DT) (15) 

nı nk 

In the general case, it might appear at the first glance that computation 
of all n regression rank scores could be a time consuming task. This is not 
true however, if they are computed sequentially in their natural order: If 
Yn:1 <<... < Yn:n denotes the ordered sample, one should compute the or- 
dered regression rank scores B (Yn:i) sequentially: B (Yn:i) is computed from 
data c(Yn:i — Yj), B(Ynsi+1) is computed from data c(Yn:ii1 — Y;), which 
differ from the first set only by adding 0.5 to two components. Conse- 
quently, only few iterations will be necessary to compute B(Yn:i+1) when 
using B(Yn:i) as starting value. Hence the actual effort for computation of 
all solutions is only the effort of computing one nonlinear regression plus 
O(n). The regression rank statistics we define below, do not use the so- 
lutions at extreme order statistics. It is convenient to start the iteration 
process at the sample median and to proceed in both directions, until the 
solutions at the e x 100%- and the (1 — €) x 100%- sample quantiles are 
obtained. 
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In general, it is useful for both motivation and investigation of (13) , to 
replace there Y; respective Yn:; with a continuously varying parameter t: 


n 


Sela = 3 oe Y;)x;. (16) 
j=l 


j=1 


The path t > B(t) jumps at the points t; = Yni = A-t (ijn), ur Bo 
H-t (u) jumps at u; = i/n, where in the usual notation, H is the empirical 
c.d.f. of the total (pooled) sample and H = n~t $? F;. A key role will be 
played by the regression rank score process for the transformation model, 


ure n? (8 o H= (u) -BoH*(u)), (17) 


Remark 1 As it was mentioned already, (13) is the ML-equation only if 
h is the logistic link function. A worthwhile alternative to the present ap- 
proach consists in replacing (13) by the ML-equation for any link function. 
This could improve efficiency, but at the cost of additional assumptions, like 
strong unimodality of fo, to guarantee uniqueness of the solution. Note, 
that our assumptions, requiring only positivity, continuity and finite first 
moment for fo, are very weak. 


Remark 2 In Gutenbrunner and Jureckovà (1992), we defined regression 
rank scores for the semiparametric shift model M5SHË in a different way, 
due to the different nature of the statistical model and the associated invari- 
ance requirements (invariance w.r.t. p-dimensional affine transformations 
versus invariance w.r.t. componentwise monotone transformations). The 
common name is used because of the common purpose, namely to define 
rank statistics for the general linear model. 


We shall define now generalized rank statistics which are appropriate 
for testing linear hypotheses H}?®"(Fo) within model MPa ( Fo). We call 
these statistics, which are linear combinations of ordered regression rank 
scores, shortly regression rank statistics. 

More specifically, we focus here on weighted averages 


i=1 


with weights satisfying X ;—] Wni = 1, Wni > 0, Wni = O if not e < i/n < 1—e 
for some € > 0. 
We assume here, that the weights are generated by a score function J 
via l 
i+ 1 
n 


wni = JÈ) - (5), 
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and that J is the c.d.f. of a probability measure, which is concentrated on 
a compact subset of the open unit interval. 

To stress the dependence of T on J and its implicit dependence on the 
link function h, we shall also write sometimes T = T(h J). Writing H 
for the empirical c.d.f. of the total sample, Yp; = H~!(i/n) and H = 
n`! 5; F;, we have the representation 


Plh, J) = So wni (Yni) = - [2 Bo oda y= j2 A(t)dJ o H(t) (18) 
i=l 
corresponding to the functional 


T(h, J) = / BoH-'(u)dJ(u) = / BE) dJ o H(t). (19) 


Within M*P"( Fo), (19) reduces to 


T(h, J) = [ow d JoH(t)e; - B 


From (19) it is clear, that the null hypothesis (6)/(7) implies CT(h, J) = 

The two “score functions” h and J determine the rank statistic T(h, J) 
in an asymmetric way: While A is crucial for the statistical model and must 
be specified correctly in order to obtain consistent estimators and tests, J 
merely determines efficiency properties like the score function of ordinary 
linear rank statistics. 

The basic idea, to replace one complicated estimating equation with a 
continuous family of simple equations and to take a weighted average of 
the family of solutions instead the one solution of the complicated equa- 
tion, is not new: Koenker an Bassett (1978) extended sample quantiles to 
linear shift model M°55 (12) using an analogous approach, Koenker and 
Portnoy (1987) and Gutenbrunner and Jurečková (1992) considered linear 
combinations of solutions. 


Example 2 Taking the design from Example 1, but without the continuous 
covariate, the regression ranks take the explicit form (15), since the model 
is saturated. The components of rank statistic of type T for testing Ho : 
Y3ij = 0 hence can be expressed directly as 


Ti; (h, J) = 3 wih| Ri; (Yr), 
[=] 


where Ri;(Ynu) is the rank of the l-th largest pooled observation w.r.t. the 
ij-th group. 
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For comparison with other rank statistics it is convenient to sum by 
parts and express the statistic as a function of the ranks Rijk = nH(Yijx) 
within the combined sample. This leads to 


Ti; (h, J) = - [(JoH-JoFy)dho, 


_ _ hy [7 ( il Rix t+05\ /_k 
B > vijk moed nee E I? 
k=1 J “J 


where R;;.% is the k-th largest rank in group tij and 


k+0. k —0. 
vin = A ta) -aA m) 
Nij +1 nij +1 


using appropriate versions of empirical c.d.f. and ranks. 


Example 3 (median scores): The simplest score function in our context 
is the median score function J(u) = I{u > 0.5]. It is particularly interesting 
because our results in this case state that the proper generalization of the 
median test to tests for general linear hypotheses consists in a very prag- 
matic procedure: dichotomize the continuous response at the pooled median 
and compute a categorial regression (like logistic or probit regression). Our 
results in Section 4 on the asymptotics imply that not knowing the true 
median (e.g. the data dependent dichotomization point) introduces an ad- 
ditional random vector to the estimator of B(0.5), which is contained in the 
linear space generated by the null hypothesis and hence does not affect test 
statistics based on contrasts orthogonal to that space. In the semiparametric 
transformation model MSP (Fy) the additional random component asymp- 
totically 1s proportional to the intercept, the slope components of B are not 
affected. In the nonparametric extension MNP (Fy), it is proportional to 
the derivative B(t) of t+ G(t) att = H~1(0.5), as follows from (25)- (28). 


Example 4 (Steam data from Draper and Smith, 1981): Doksum 
(1987) used these data to demonstrate his Monte Carlo approximation for 
the MPLE (approximately maximizing Hoeffding’s partial likelihood (4)). 
The data are an example of a simple linear regression with a good fit of 
the parametric linear model, hence a case where the ordinary least squares 
estimator (LSE) is appropriate. 


The response Y is pounds of steam per month needed by a power plant, 
the regressor x average atmospheric temperature in degree Fahrenheit. In 
the usual notation 


Y; = a + Bx + ou, 
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one has to be aware that our B = (61, 82) from the transformation model 
is (a/o,@/c), because Fo((t — a — zp) /o) = Fo(t/o — a/o — xB/o) = 
Fy(a(t) — a/o — 26/c). 

We compare examples of our regression rank statistics with the LSE 
and Doksum’s likelihood sampling estimator. The second, third and fifth 
regression rank estimators correspond to the score function J(u) = (—5) v 
3(u— 5) A4 5, defining a trimmed mean type statistic. There is a remarkable 
gap between the LSE and our estimators on one hand and Doksum’s esti- 
mator on the other hand. Also replacing estimating equation (13) by the 
ML-equation for probit regression takes the estimator nearer to the LSE. 


The estimates in descending order: 

-.092 LSE (95%-confidence limits: [—.136, —.048]) 

-.090 trimmed mean of regression rank scores, probit link function, using 
ML-equation for dichotomized data instead of (13) (trimming: 33% 
from both sides). 

-.084 trimmed mean of regression rank scores, logit link function (trimming: 
33% from both sides). 

-.082 regression rank score median, probit link function, using (13). 

-.081 trimmed mean of regression rank scores, probit link function, using 
(13) (trimming: 33% from both sides) 

-.064 Doksum’s likelihood sampling estimator 


4 Asymptotic representations 


In this section we show that the asymptotic representation of regression 
rank statistics T(h, J) (18) shares important properties of the correspond- 
ing representations for linear rank statistics as it was recently developed by 
Akritas and Arnold (1994), Akritas and Brunner (1996) and others. 
According to the asymptotic behaviour of the design we assume 


Xll = o(n!/?)  asn—o, (20) 

|(n XX) l2 = O0(1) asn —> oo, (21) 

Ki “1 rlx: > K]—=—0 asK->o. (22) 
i=]1 


For F;(t) = Fo(x,3(t)) we assume a continuous density f;(t). This 
assumption is equivalent to continuous differentiability of G(t) w.r.t. t. We 
denote the derivative w.r.t t as B(t). 

Much in the spirit of Pyke and Shorak (1968), we shall expand the 
regression rank process (17) into the difference of two empirical processes. 
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Writing Z(t) for the vector with components c(t—Y;), we may write (16) 
as 


g:(G(t)) =0, where 
g:(8) =X [FyoXP-Z(t)], B € IR’. 
The derivative D; = (0/08)g:(G(t)) is 


D; = X diag[ fo o XG(t)|X. (23) 


The following steps are standard calculations: Denoting with o*(-) and o% (-) 
convergence uniformly in t for t in compact sets, 


0 = g f(t) = go p(t) + D,[A(t) — B(t)] + o (IBC) - PWI) 
X (F(t) — Z(t)) + DBE) - BO] + ~ (BO -BHN 


hence 
n™/?(B(t) — B(t)) = Wa (t) + of(1), 


with the empirical process 
Wan (t) = n? D7’ X (Z(t) — F(¢)). (24) 


We used here, that fo(x,(3(t)) is bounded away from zero for t in compact 
sets. The next routine step is 


J/n(B o A~ — Bo H`) 
= W,0H 1+ V/n(80 B~ — Bo H~!) +o%(1) 
= W,0oH'+/nBo H(A — H~) + of(1) 
ĥo H` — id. 
= a! Tin nets —1 * 
= W, poH vn Gon Bo H~ +0,(1) 
= W, po H` — V, o H! +o5%(1) (25) 


with the second empirical process 


H(t) — H(t), 
(t) = yn HO B(t) (26) 
In (25), the asymptotic replacement of Jn(H -1 — H-) by -yn HoH id) 


follows the argument given e.g. in Serfling (1980), sect. 2.8.3., p. 112. 

The vector H(t)-t6(t) may also be written as (X X)~!X'd(t), d,(t) = 
(fi(t)/f.(t))h' (Fi(t)). If HXP@™(Fo) is true, the derivative CA(t) of 0 = 
C(t) is zero, therefore V,,(t) has the important property 


CV,,(t) =0 under Hi?" (Fo). 
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Hence, defining 
M; (t) = / (I[t < s] — F:(s))(Dy!x; — H(s)"1A(s))d Jo H(s) and 


M(t pe] (Ilt < s] — F,(s))(Dz1x,) d J o H(s), 
we have sketched the proof of the asymptotic representation stated in the 


following 


Theorem 1 Within model MNP (Fy), if the density fo is continuous, pos- 
itive on IR and has finite expectation, the score function J is trimming and 
of bounded variation and conditions (20)-(22) are satisfied, then 


T(h, J) = T(h, J) +n YO M;:(Y;:) + opn’), (27) 
t=1 
and, under HP® (Fo), 
CT(h, J) = nt Y CM} (Y;) + 0,(n71/?). (28) 
1=1 


The regression rank score process (17) has the asymptotic representation 
J/n(BoH'(u) — BoH™!(u)) = Wro H~ (u) — Vno H~} (u) + of(1), 


where 03 (1) denotes approximation uniformly in u for u in compact subsets 
of (0,1) and the empirical processes W, and Vy, are given by (24) and 
(26). o 


(27) and (28) imply multivariate asymptotic normality with an asymp- 
totic covariance matrix that under H}??"(Fo) has a simplified structure 
that follows from the covariance function K(s,t) of Wn: 


K(s,t) = D>'A(s,t)D;?, (29) 
A(s,t) = X diag(Fo[x,A(s A t)] — Fo[x;(s)] Fo[x;A(t)])X. (30) 


Corollary 1 Under the assumptions of the theorem, T(h, J) asymptoti- 
cally has a multivariate normal distribution. Under H, NPar (Fy) the covari- 
ance matriz of CT(h, J) may be estimated consistently by 


Cov(CT(h, J)) CIS writing R (Yni Yess IC 


i,j 


where K(s,t) is obtained from (23), (29), (30), replacing B(t) with B(t). 
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5 Extension to nonlinear models and application 
to ROC analysis 


In this section we briefly outline the extension of our method to nonlin- 
ear models and its application to the analysis of ROC (receiver operating 
characteristic) curves when covariables have to be taken into account. 

Transformation models and ROC models are in close correspondence 
because of their invariance. For a given statistical model 


Yi ~ hi, 
the matrix p(F) = {p;i;} of ROC functions 


extracts the ordinal invariant part of information contained in that model. 
As at the level of statistics the vector R(Y) of ranks is maximally invariant, 
at the level of functionals, the matrix of ROC functions has this property: 
Any ordinal invariant functional T*(F) may be written as a functional of 
p(F). 

Parametric ROC models correspond to semiparametric transformation 
models. For example, the parametric ROC model 


pig (u) = Fo(Fg (u) + pj — pi) 


corresponds to MSPat( Fo) (3). 
On the other hand, starting instead of (3) with the heteroscedastic trans- 
formation model 


t) — u; À l 
Yi ~ R(W=*), Hi = P Xi, 0i = Y Xi, (31) 


4 


we arrive at the ROC model 


pylu) = Fo( LAS u) + EE), (32) 


which e.g. was used as starting point in Hsieh (1996) (for Fo the extreme 
value c.d.f.). 

Tosteson and Begg (1988) proposed to analyze the discrete-response 
version of (31) respective (32) with the PLUM-program of McCullagh (they 
assumed Fo to be the logistic c.d.f.). Considering the nonparametric version 
of (31) (6 and y depending on t) leads us to nonlinear nonparametric 
transformation model 


P(Y; < t) = Fo o g(x;, 9(t)), V:R=—@©O unknown, smooth, 
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where 
g:XxO-R 


is a known, smooth second link function and © a Euclidean parameter 
space. In model (31) we would use g(x;, 3,7) = B XiJy Xi. The corre- 
sponding family of estimating equations is 


>. Fo o g(x, B(t))xy = SF elt — Yj) xy, 
j=l j=l 


which however in general is not the gradient condition of a convex function, 
hence some additional assumptions are necessary to guarantee consistent 
estimators. 

In an analogous manner, we may define the nonlinear regression rank 
score process u +> n/2(9 o H-\(u) — 8 o H~\(u)), nonlinear regression 
rank statistics T,(h, J) = f g(x, 0(t))dJ o H(t), corresponding to func- 
tionals T,(h, J) = f 9(x, ¥(t))d J o H(t), hypotheses Ho : g*(x, ¥(t)) = 0 
(g* another known link function) and test statistics based on T*(h, J) = 
f g(x, 0(t))dJo H(t). 
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Abstract: Unimodality, in its weaker and stronger forms, enters the ro- 
bustness investigations somehow less often than symmetry. We point out 
how unimodality affects the asymptotics of M-estimators under heteroge- 
neous (“non-i.i.d.”) errors. Sufficient conditions are given for consistency, 
with rates, of M-estimators in unimodal heterogeneous location models. 
For heteroscedastic models, a particular case of heterogeneous ones, a 
necessary and sufficient consistency condition, with rates, is provided for 
the Lı estimator - the sample median. 
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1 Introduction 


In robustness theory, the assumption of symmetry is adopted quite regu- 
larly, although with a bit of strange taste: as pointed out by Huber (1981, 
page 95), “a restriction to exactly symmetric distributions ... violates the 
very spirit of robustness”—since it is not stable under small perturbations 
of the underlying probabilities. On the other hand, the symmetry assump- 
tion resolves a dilemma of estimands—the problem of finding the target of 
location estimation. For symmetric population distributions, the center of 
symmetry is widely accepted as the “natural” location parameter—see, for 
instance, Hoaglin, Mosteller and Tukey (1983, chapter 9). And, needless to’ 
say, symmetry considerably simplifies a number technical considerations. 
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Unimodality, in its weaker and stronger forms, enters the robustness in- 
vestigations somehow less often. It shares the instability of the symmetry— 
small perturbation (in weak topology sense) of a unimodal probability can 
result in a non-unimodal one. However, in terms of “realism”, unimodality 
performs better; recall only that almost all parent distributions involved 
in parametric models are unimodal (compared to a considerable number of 
asymmetric ones). And, even in the symmetric case, the center of symmetry 
frequently also is the mode. The impact of unimodality on the philosophy of 
estimation is perhaps not that unambiguous: it could be a matter of a dis- 
cussion whether the mode, instead of the center of symmetry, can fulfill the 
need for the “natural” location in models with asymmetric (but unimodal) 
parent distribution. At least, the consequences of substituting unimodality 
assumptions for the symmetry ones deserve being closely investigated. 

Skipping this problem, we shall concentrate on the technical virtues of 
unimodality. These are really rewarding. For instance, as shown in Miz- 
era (1994), unimodality is most helpful in establishing the consistency of 
redescending M-estimators. The paper of Freedman and Diaconis (1982) 
pointed out that M-estimators can be inconsistent, due to non-identifiabi- 
lity—the lack of well-defined population value—unless the score function 
is monotone or the underlying distribution is symmetric and unimodal. 
Mizera (1994) showed that unimodality ensures the uniqueness of the pop- 
ulation value also in asymmetric cases, for the majority of M-estimators 
with the non-monotone (“redescending” ) score functions used in practice. 

In this note, we point out how unimodality affects the asymptotic of M- 
estimators under heterogeneous errors. The violation of the i.i.d assump- 
tion, the assumption which is central to most existing statistical models (re- 
call i.i.d. error terms in regression or i.i.d. innovation process in time series) 
can arise from contamination, population heterogeneity, uncontrollable and 
hidden confounding factors, and variations in the measurement techniques 
or environmental conditions, the characteristics of which may vary through 
time and space. Different aspects of the asymptotics of M-estimators un- 
der heterogeneity were studied in Mizera and Wellner (1996) and Hallin 
and Mizera (1996). We concentrate here on the more specific consequences 
of unimodality. Sufficient conditions for consistency, with rates, are given 
for unimodal heterogeneous location models. For heteroscedastic models, 
a necessary and sufficient consistency condition, with rates, is established 
for the L1 estimator—the sample median. This condition is considerably 
more general than an earlier one by Sen (1968). 
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2 Consistency of M-estimators in heterogeneous 
location models 


Associated with a nondecreasing score function w we define 
n 
Àn (Y, t) = = NO Y(Xni — t). 
i=1 


An M-estimate is defined to be a “solution” of the equation An(w,t) = 0; 
since Àn (%, t) is monotone, the values of t at which An (4, t) crosses the zero 
level constitute an interval. To avoid ambiguity, we define the M-estimate 
to be the infimum of this interval: 


Tn y = sup{t : An (Y, t) > 0}. 
A heterogeneous location model consists of 


(H1) a set of data Xn1, Xn2,...,Xnn, which can be viewed as realizations of 
independent random variables 


(H2) with distribution functions Fy, Fn2,. .- , Fan, respectively, 


(H3) such that E [w(X,; — 0)| = 0, for all i = 1,2,...,n and for all n = 
1,2,.... 


The framework of (H1)-(H3) reduces to the standard i.i.d. one whenever 
Fri = Fro = ... = Fan = Fo, where Fo(x) = F(x — 0). In such a case, the 
Fisher consistency condition (H3) reduces to a much simpler and traditional 
one involving Fg only. However, in heterogeneous situation we need (H3) 
as the only thread connecting all Xn¿’s and F,,;’s together, ensuring that 
our estimation of 0 makes any sense at all, that our data are not “a bizarre 
melange without much statistical relevance” (Le Cam 1986, page 529). 

A heterogeneous location model is called symmetric if all F,;’s are sym- 
metric about 0, and unimodal if all F,,;’s are unimodal with mode 8, 
i = 1,2,...,n,n=1,2,... . A distribution G is called unimodal (with 
mode @) if it possesses a density g which is nondecreasing on (—oo, 6] and 
nonincreasing on |0, co). 

Various other unimodality concepts could have beeen considered. The 
weakest one, merely requiring the existence of a unique global maximum for 
the density, is too weak for most purposes. The slightly stronger definition 
formulated in terms of convexity and concavity of the distribution function 
G before and after the mode is closely related to ours—the only difference 
is that, unlike ours, it allows for an atom located at the mode. As for 
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the strong unimodality concepts (log-concavity, for instance), they are not 
needed here. 

Note that, in symmetric models, (H3) automatically holds whenever w 
is odd (that is, ~(—x) = —y(x)). This, together with a natural interest 
in estimating the center of symmetry in the symmetric models explains 
the almost exclusive choice of odd score functions w in the practice of M- 
estimation. Thus, we shall assume about our score functions w that 


(P1) is a non-decreasing and odd function. 


To ensure robustness, we adopt boundedness; since a multiple of w= yields 
the same M-estimates, we set 


(P2) W(-00) = —1, (00) = 1. 
To avoid pathologies, we also suppose that 


(P3) the set of discontinuity points of ~ is finite 


and, finally, that 


(P4) w is increasing at 0: for every € > 0, there is a 6(€) > 0 such that 
p(e) — o(-€) = 2WwWle) > 26(e). 


We remark that both (P3) and (P4) are satisfied by all score functions 
used in practice. For unimodal distributions, (P4) can ensure identifiability 
(uniqueness of the “population value”) of the M-estimator. 

Consistency holds whenever the model is conservative: that is, whenever 
the sequence of average distribution functions 


Pos : SRE 
i=1 


is tight (weakly sequentially compact; recall that a sequence Gn is tight 
if and only if for any € > 0 there is a K: > 0 such that Gn(—Ke) < €/2 
and Gn(K:) > 1—e/2 for all n). An important special case of conservative 
model is the mixture model: the sequence F, converges weakly to a (proper) 


distribution function F. The behavior of robust estimators in mixture 
models was studied by Stigler (1976). 


Theorem 1 Suppose that Xni satisfy the assumptions (H1)-(H3) of the 
heterogeneous location model; suppose further that this model is unimodal 
and conservative. If (P1)-(P4) hold, then Ta y — 0 = op(1) as n — oo. 
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Proof: See Section 3. 


Under a slightly stronger assumption about w 


(P4’) there are non-negative integers q1, q2 such that qı + q2 = 1, and func- 
tions ~1, %2 satisfying (P1) and (P2) such that y = qi + gor, 
where y%ı(x) = sign(xz) and %2 is absolutely continuous on some in- 
terval [-A, A], A > 0, with a derivative w’ satisfying 7)'(x) > K for all 
xz € [—A, A] — for some K > 0 


Theorem 1 can be strengthened to yield consistency rates. 


Theorem 2 Suppose that Xni satisfy the assumptions (H1)-(H8) of the 
heterogeneous location model; suppose further that this model is unimodal 
and conservative. If (P1)-(P3) and (P4’) hold, then Tn y — 0 = Op(n7'/2) 


as n — OO. 


Proof: This theorem directly follows from Theorem 6 of Hallin and Miz- 
era (1996). 


Theorems 1 and 2 show that robust M-estimators behave in unimodal and 
conservative heterogeneous location models like in the i.i.d. case, where it 
can be said that they are always consistent—as soon as the corresponding 
population values are identifiable (see Huber 1981, page 54). The assump- 
tions of Theorems 1 and 2 are easily checked in the particular case of 
heteroscedastic location models: heterogeneous location models with distri- 
bution functions satisfying 


haer ( 7 =) | 


Cni 


where Cn1, Cn2, - - - , Cnn are positive scaling constants and F is the distribu- 
tion function of a fixed parent distribution. Note that every heteroscedastic 
model with symmetric and/or unimodal F is itself symmetric and/or uni- 
modal. 

For heteroscedastic models, we are able to state necessary and sufh- 
cient consistency conditions, with rates, for the special case of the L1 
estimator—the sample median. Compared to the general conditions es- 
tablished in Mizera and Wellner (1996), our condition is specially tailored 
for heteroscedastic models, since, under a very mild regularity condition 


(S) the parent distribution admits a density f which is bounded, and there 
are À > 0, L > 0 such that f(x) > L for x € [-A, Al, 
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which is clearly satisfied by any unimodal parent distribution with bounded 
density, it involves only the “empirical distribution” of the scaling con- 
stants. The conservativeness of the model is no longer required; our results 
hold, in particular, when some part of “probability mass” is allowed to “es- 
cape to infinity”. Let ®, be the function from (0,00) to (0,00) defined by 
e(x)=1/c if x < cand 1/z if z > c. 


Theorem 3 Let Tny be the sample median (that is, p(x) = sign(x)). If a 
heteroscedastic model satisfies (S), then Tay — 0 = op(r;,,") if and only if 


1 n 
ie an ®, neni ’ 
we (TnCni) — CO as n — OO (1) 


for any fized c > 0. 
Proof: See Section 3. 


Note that the choice of c for ®, is inessential, due to the following elemen- 
tary inequality, holding for any c < d 


D(x) > dalz) > =#.(c). 


In other words, (1) holds for all c > 0 as soon as it holds for one c > 0. 
Condition (1) implies that 


1 n 

~ 2 ong as n — oo. (2) 
Under a non-degeneracy assumption, together with conservativeness and 
some additional regularity requirements, Sen (1968) proved an asymptotic 
normality result, from which (2) follows as a necessary and sufficient con- 
sistency condition. In fact, (1) and (2) are equivalent as soon as all Cni > c 
for some c > 0. For the particular case of plain op(1) consistency (rn = 1), 
we obtain that (1) is equivalent to 


oe (Cni) > CO asn —> oo, (3) 


the condition already established in Hallin and Mizera (1996). Finally, 
Lemma 6 of Hallin and Mizera (1996) yields the following corollary. 


Theorem 4 Under the assumptions of Theorem 3, Tay — 9 = Op(s;") 
if and only if (1) holds (for some c > 0) for any sequence rn such that 
ii =O Sx) 


Proof: A direct consequence of Theorem 3 and Lemma 6 of Hallin and 
Mizera (1996). 
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3 Proofs 


In the proofs, we write 0 = 0, without loss of generality. 


Lemma 1 Suppose that the sequence Fn, n = 1,2,... is tight and that the 
corresponding densities fn are unimodal with the common mode 0. Then, 
for every n > 0, there is a K, such that 


Kn _ 
eae J. A (4) 
and E E 
max{ fn(—Kn), fa(Kn)} <n (5) 
for all n = 1,2,.... 


Proof: First note that (4) is just the tightness condition rewritten in terms 
of densities. Turning to (5), assume that it does not hold: then, there is an 
n > 0 such that for all K 


either fal -K) > or fr( K) >n. (6) 


Set K = 2/n and suppose that f,(—K) >, say. By unimodality, 


[foley ax < [nde => 


— a contradiction; the other case in (6) is treated analogously. At this 
point, we could possibly have one value of K for (4) and another one for 
(5) — but the maximum of them two works at both. 


Proof of Theorem 1: For any e > 0, let 
1 n 
= n — a — 0 dF; 
On(b, €) = Enl, 8 —€) = > > | ve + €) dFni(x) 


and 


bn (a, €) = EM (4,0 + £) = D [ve — 0 — e) dFri (£). 


In view of Theorem 1 of Hallin and Mizera (1996), it is sufficient to show 
that for any € > 0, an(p,£) and b,(w,€) are bounded away from zero for 
sufficiently large n. We give a proof for an(%,£€); the proof for bn(%,€) is 
entirely similar. In view of (H3), it is sufficient to show that, for any e > 0 
(setting again 0 = 0), 


[We +e) - ¥@)] fale) de > 0 (7) 
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for sufficiently large n (note that, due to monotonicity, the integrand in (7) 
is non-negative). 

Fix € > 0. By (P4), there is a 6(€/2) such that w(e/2) > 6(€/2); hence 
we have, for all z € [—e/2, 0], 


plz +e) — p(z) > U(x +e) > Ye) > 8e). (8) 
Now, choose 7 and C > 0 such that for K, given by Lemma 1 we have 
n+ CK, < min{36(4e), 3}. 
If fa(€/2) > C, we have by (8) 
= 0 a 
| Wete)-v@) fula)dr> | 4e)Fnla) de > 2eC6(4e), 


due to unimodality. If f,(e/2) < C, Lemma 1 gives 


[~ fn(£) dx J Í fn(x) dx + [fo fr(x) dz 


n+ (Kn —te)C <n+CK, 


IA 


due to unimodality again; hence, 


E jp ina) de 2 L=n—CK, 


and, consequently, 
[We +2) - wa) Jaade = | Wo +6), (a)de 
ine p(x +e€)fn(x) dx + a p(x + E€)fa(£)dz 
—oo —e/2 


> i —1fn(x) dx + J, 6(4€) fn(x) dx 
> —n— CK, + êle) -n — CK) 
> —48(Je) + 26(4e) = 48e). 


In both cases, we have that (7) is bounded from below by 
min{ieC'd(4e), 16(4e)} > 0, 


which proves the statement. 
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Proof of Theorem 3: Let y(x) = sign(z), let f be a density of the parent 
distribution of a heteroscedastic model satisfying (S). Then, f is bounded 
by K; without loss of generality we may suppose that K > 1. 

Necessity. By Theorem 3 of Hallin and Mizera (1996), op(r7') consis- 
tency implies that ,/nan(w,rz, 1) > œ as n — oo. Proceeding similarly as 
in the proof of Theorem 8 of Hallin and Mizera (1996), we obtain 


vVnan(p, ra ) = a 3 f(y)dy 
n Vn 4 J-(rneni) 


IA 


> » i fy)dyt+ $> [ od 


1 
nni TnCni> 1l (ina) 
1 
DR O De 
e TnCni>l Mnni 


1 
K—=Y 1(rreni), 
a ug 


and (1) follows. 
Sufficiency. Let (S) hold with À and L; choose T and c such that T = Ac. 
If 0 < € <T, then, as in the proof of Theorem 7 of Hallin and Mizera (1996), 


oer) = = f(y)dy 


IV 


IV 


| 


elfami) A 

1 0 3 f yd 
= + 
4 2» Pell + o rn€ni>c E(TnCni)~ J á 
1 
= y) dy + [ y)d 

1 1 1 
Lez Pe re i pas 1] 

12 
Lez 2 PelTnCni). 


An entirely similar argument for bn (Y, er, 1) and the subsequent application— 
assuming (1)—of Theorems 2 and 3 of Hallin and Mizera (1996) conclude 


the proof. 


56 Marc Hallin and Yvan Mizera 


Acknowledgements 


For the first author the research was supported by the Fonds d’Encourage- 
ment à la Recherche de l’Université Libre de Bruxelles and the European 
Human Capital contract ERB CT CHRX 940963 and for the second au- 
thor, the research was supported by the Fonds National de la Recherche 
Scientifique, the Banque Nationale de Belgique, and Slovak GAS grant 
1/1489/94. 


References 


[1] Freedman, D. A. and Diaconis, P. (1982). On inconsistent M- 
estimators. Ann. Statist. 10, 454-461. 

[2] Hallin, M. and Mizera, I. (1995). Sample heterogeneity and the asymp- 
totics of M-estimators. Preprint IS-P 1996-15 (No. 49), Institut de 
Statistique de l’Université Libre de Bruxelles, Brussels. 

[3] Hoaglin D. C., Mosteller F. M., and Tukey, J. W. (1983). Understanding 
Robust and Exploratory Data Analysis. New York: Wiley. 

[4] Huber, P. J. (1981). Robust Statistics. New York: Wiley. 

[5] Le Cam, L. (1986). Asymptotic Methods in Statistical Decision Theory. 
New York: Wiley. 

[6] Mizera, I. (1994) On consistent M-estimators: tuning constants, uni- 
modality and breakdown. Kybernetika 30, 289-300. 

[7] Mizera, I. and Wellner, J. A. (1996). Necessary and sufficient condi- 
tions for the consistency of the sample median of independent but not 
identically distributed random variables. Preprint IS-P 1996-6 (No. 40), 
Institut de Statistique de l'Université Libre de Bruxelles, Brussels. 

[8] Sen, P. K. (1968). Asymptotic normality of sample quantiles for m- 

dependent processes. Ann. Math. Statist. 39, 1724-1730. 

Stigler, S. M. (1976). The effect of sample heterogeneity on linear 

functions of order statistics, with applications to robust estimation. 


J. Amer. Statist. Assoc. 71, 956-960. 


cS 


L, -Statistical Procedures and Related Topics 
IMS Lecture Notes — Monograph Series (1997) Volume 31 


Lı-test procedures for detection of change 
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Charles University, Prague, Czech Republic 


Abstract: L,-type test procedures for detection of a change in linear mod- 
els are proposed, their properties are studied under the null hypothesis 
(no change). 


Key words: L,-test procedures, linear models, change point. 


AMS subject classification: 62G20, 62E20. 


1 Introduction 


The problem to detect and to identify changes in statistical models has 
attracted a number of researchers in the last two decades. Using vari- 
ous principles they have proposed a number of statistical procedures that 
are sensitive w.r.t. detection of changes, have studied their (mostly) limit 
properties and, also, have applied to real data sets. 

The problem of detection and identification of changes in statistical mod- 
els is known as the change point problem (mostly for case of changes in 
location models), disorder problem or testing the constancy of regression 
relationship over time. These problems arise in a number of applications 
(economic modelling, quality control, biology, medicine, meteorology and 
ecology among others). 

We shall consider here the following regression model with possible 
change after an unknown time point m: 


Y,=xi BP +xibnl{i >m} + Ei, i=1,...,n, (1) 


where x; = (Tirst) > zi = 1,2 = 1,...,n, are known regression vec- 
tors, m(< n), B = (G1, ---Bp)* 6n = (6n1,---; np)? are unknown parameters, 
E, ..., En are ii.d. random variables with common distribution function 
F. I{ A} denotes the indicator of the set A. 
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The model corresponds to the situation when up to an unknown m the 
observations follow the regression model with the regression parameter ß 
and then the model changes to the regression model with the regression 
parameter B + ôn. The parameter m is called the change point. 

The problem of our interest is to test 


H, :m =n against Hi :m <n. 


The authors usually apply either the likelihood ratio principle or the 
Bayesian approach. The first principle leads to max-type procedures the 
other gives sum-type procedures. 

First, we shall describe likelihood ratio and related procedures when the 
distribution of the error terms F is N(0,0°), o? > 0 known. It will give 
motivation how to develop L1- procedures. 

Assume that e's are i.i.d. with distribution N(0,07), o? > 0 known, the 
likelihood ratio principle leads to the test statistics 


T = =a 2 
nLSE = | Max ae Yrsel Ti Be LSE) (2) 


= 3 PLsE(Y; — Ti TBk LSE) Sate Bn LSE)}/0°, 
i=k+1 i=1 


where prse(x) = x7, x € RI, Pk LSE and By rsg are the least squares esti- 
mators of the regression parameters based on X1, ..., Xk and X444,...., Xn, 
respectively, 1.e., 


k n 
a | —1 
Bersa = Ck > Ya Birse=Ce > uM 


i=1 i=k4+1 
with 
k n 
Ck = N xix; , Cc. = >: xix? . (3) 
i=1 i=k+1 
The test statistics Tn 75% can be expressed equivalently as: 
Tn LSE = MaX {(6 Br os T Co) (4) 
n, pchen pt \PhLSE — Pk,LSE k k 


(a = B% LSE) /o°} 
and 


leoia mor. (Sense (Ce t Cp CXT) Sk LSE/0 T (5) 
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where r 
Sk, sE = X xi(Yi — x] Bn,LSE). (6) 
=i 


Horváth (1995) among others derived the limit distribution of Thn, LSE under 
H, and showed that if mild assumptions are satisfied then under Ho, as 
n — ©O, 


max { SE psp (Cy CnC} Sk LsE/0°, 

k = |n/lognl,...,n — [n/ log n) }/Ta LSE — 0. 
which means that asymptotically even under Hp the terms with k ” small” or 
close to n dominate the others. To avoid to this unpleasant property some 


modifications were proposed. Namely, the class of test statistics depending 
on a suitable weight function q was introduced: 


Sk rSECn Sk,LSE 
Tn,tse(q) = Max e] 


(7) 
Typical choices of the weight function q are the following 
g(t) = (E-t)? tE (a1, 2) (8) 


q(t) = 0 otherwise, 


where 0 < aj < ag < 1, or 
q(t) = (t(1-¢))"% = te (0,1/2), (9) 


with y € [0, 1/2). 

Some authors (e.g. Jandhyala and MacNeill, 1989; Ploberger and Kr amer, 
1992) suggested to apply procedures based on properly standardized partial 
sums of the LS E-residuals: 

k 


Sk ese = X (Yi -xi Bnese), k =1,2,....7 (10) 


i=1 
They proposed a computationally feasible procedure: 


52 
Tp,LsE(4) = max, (ee (11) 


where q is a weight function. Another type of procedures is based on moving 
sums (MOSUM) of the LSE-residuals. They are defined by: 


1 oO Oo 
Ta LSE(G) = Max (ql Skuse = Sk-G LSE|\/ o \ (12) 


G<k<n 
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and 


T = max 
n, LSE — Qek<n-G 


1 O oO 
(plese — 28% LSE + Sk-G LSE|/T \ (13) 


Bayesian type of test statistics have the form 


a 1/2 
Tr'rse(v) = >. v(k/n){ Sirgen Sr rse/0)} (14) 
k=l 


where vu(1/n),...,u((n — 1)/n) represent priors. 

Inspite that the procedures were developed for normally distributed ran- 
dom errors they can be applied also for nonnormally distributed random 
errors with zero mean and finite absolute moment of the order 2+A(A > 0) 
(in some cases a finite second moment suffices). 

If oĉ is unknown it is recommended to estimate it by 


k n 
Ô2 LSE = min D = xi Bese) a ` (y= x? Bk se)? }- 
i=] i=k+1 
and plug into the above statistics. 

Typically large values of the introduced test statistics indicate that the 
null hypothesis H, fails. The exact distributions even under H, of the 
above introduced test statistics are unknown. The limit distributions were 
derived under mild assumptions on the distribution of the error terms Ejs 
and on the design points x1,...,x,, which enable to get the approximations 
for the critical values. 

The test procedures corresponding to Th LSe and Th Lsg(q) were stud- 
ied by a number of authors, e.g. Quandt (1958, 1960), Worsley (1983), 
Kim and Siegmund (1989), Gombay and Horvath (1994), Horvath (1995), 
Antoch and Huskova (1992). Well known is the paper by Brown, Durbin 
and Evans (1975) devoted to the procedures based on recursive residuals. 
Bayesian type procedures were proposed and studied by Broemling and 
Tsurumi (1987) and Jandhyala and MacNeill (1989, 1991, 1992). Proce- 
dures based on partial sums of LSE- residuals were investigated, e.g., by 
Jandhyala and MacNeill (1989) and Ploberger and Kramer (1992). Hackl 
(1980) deeply studied procedures based on moving sums (MOSUM). A 
number of applications in econometrics is contained in Hackl (1989) and 
Hackl and Westlund (1991). Horvath, Hušková and Serbinowska (1995) | 
considered the case when the change can occur in the regression parame- 
ters and/or in the scale ø. 

Along the same line M— type and R— (rank based) tests were developed 
and studied. Some results on the M— procedures for changes in regression 
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models can be found, e.g., in Sen (1984) and Hušková (1990a,b, 1994 a,b). 
R— type test procedures were studied by Sen (1980, 1982) and Huskova 
(1994b). 


2 Lı procedures 


It is easily seen that the test statistics Tn 9's are functions of least squares 
estimators and of the LSE-residuals Y; — xf Bgg, where By op is a least 
squares estimator, therefore Tn Lspg’s can be viewed as the Lə- type test 
statistic. 

Now, along this line the L,-procedures will be developed. Namely, we 
replace the LSE estimators by Lı estimators, prsg by pz, (x) = |z|, x € 
R', and the LSE-residuals by Ly-residuals Yz, (Y; — xf Bz,), where Bz, is 
an Lı estimator B and Yz, (x) = —1, x < 0, oz, (x) = 0, x = 0, Yr, (x) = 
1l,z>Q0. 

From three equivalent expressions for Tn, r3m@ ((2), (4), (5)) we get three 
different test statistics. Namely, 


k 
Tris = max (2F(F(1/2))(— OM —aF Ben] (15) 


1=1 


- Ð Y-a i214 01% - 278,21}, 


i=k+1 i=1 
(2) == T2 —1 * T 
Tris = max {4P (E12) (Brr — Bhra) (16) 
(C tS (Brr = Bkr )} 
and 5 
T —1 *—] 
TË) = max {Skr (C: A Sanh: (17) 


where f(F~1(1/2)) is an estimator of f(F~1(1/2)), F7} and f denote the 
quantile function and the density, respectively, 


k 
T 
Sita = > Xr, Vi —% But), (18) 
i=l 
and Bkz, and By ,, are the Lı- estimators of the regression parameters 
based on X1, ..., Xk and X11, ---- Xn, respectively, i.e., they are defined as 
solutions of the minimization problems 
k 
min Š ` |Y; — v7 t|, t € RP (19) 
i=1 
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and n 
min > IY; — ai vl, v € RP, (20) 
i=k+1 
respectively. 
The statistic Jia A is a likelihood ratio type statistic, T a is a Wald 
type test statistic and aT _ is a score type test statistic. 
Computational feasibility of the statistic T A is evident. The statistics 


T : T depend on the estimator of f(F~+(1/2)) and also on the estima- 
tors Bk z, and By ,,,k =1,...,n. Quality of the estimator of fe 72) 
strongly influence the aaa of the test itself. 

The weighted type test statistics are defined by 


T —1 
Sk L; Cr Sk, Lı \ 
? 


Tht (q) = max { q(k/n) 


(21) 
where the weight function q is the same as in LS E-case. 

Next, we introduce the test statistics based on partial sums of L4- 
residuals 


k 
Ser. = > v1, (Vi Ak But), & = 12,47 (22) 
i=1 
We get the weighted sum type and MOSUM type test statistics 
(0) ISẸ Ly | 
mb’) = rae (Saath)? (23) 
* 1 (0) O 
nts (G) = ax {alk — St-atalf (24) 
mh = oax | alse. t 2h + R-an (25) 


where q is a weight function. 
Finally, Bayesian type of test statistics have the form 


1/2 
TB, (v - Su k/n) {2S r, CR Ski} - 


where vu(1/n), ...,u((n — 1)/n) represent priors. 

Analogously as in the Lə situation large values indicate that H, does 
not hold. The exact distribution of the test statistics even under the null 
hypothesis can be hardly obtained. The limit distributions under Ho can 
be derived (see Theorem 1 - Theorem 3 below) which are then useful in 
getting approximations to the critical values. 
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Now, we pay attention to the limit behavior of the introduced tests 
statistics under the null hypothesis. 
We consider the assumptions: 


(i) Random variables Yj, ..., Yn follow the model (1) with n = m and the 
distribution F has median 0 and Lipschitz of order yı € (0,1] and strictly 
positive density at the median, i.e., F~1(1/2) = 0, |f(0) — f(x)| < D|” 
for some D > 0 and all x in a neighbourhood of 0 and f(0) > 0. 

(ii) limpoo Cine} = tC, t € [0, 1], for some C > 0. 

(iii) There exist € € (0,1) and y2 > 0 such that, as n — on, 


1 
IZ C — C| = O(k-®) 


Ic - Cl = O(n) 


uniformly for 1 < k < ne, where C is the same as in (ii) and ||.|| denotes 
the Eucledian norm. 
(iv) As, n > œ, 


k n 
1 3 1 5 
m a i PI i = QO(1). 
ee Dl + eZ, lal) ms 


(v) f(0) be an estimator of f(0) such that, as n — oo, 
f(0) — f(0) = op((log logn) t?) 
Theorem 1 Let assumptions (i), (ti), (iii) and (iv) be satisfied then 


lim P(a(logn) Gee < t+ b,(logn)) = exp{—2exp{—t}}, te F}, 


n— CO 


(26) 
where 
aly) = (2logy)'/, bly) = 2logy + F loglogy — log(2T (p/2)), y > 1, 
(27) 
and 


2) = f t! exp{—t}dt. 


If, moreover, (v) is satisfied then the assertion (26) remains true if 
a” is replaced by ee or ae) 2. 


Theorem 2 Let assumptions (i), (ii) and (iv) be satisfied then, as n — oo, 


(SE, BP)? 
(Tau (DP >? sup {Sf (28) 
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and 


2 (0) sup {BHO 


D — . Á 


TA EE > [vit Br) 2g (30) 


where {B,(t);t € (0,1)} are independent Brownian bridges and q is the 
weight function defined by either (8) or (9) and v(t) = (q(t),/t(1 —t)) t, te 
(0,1). 


Theorem 3 Let assumptions (i), (it) and (iv) be satisfied and let, as n — 
OO; 


G/n — 0, Gin? logn = 0, (31) 


then 
lim P(a(log(n/G))Tx 1, (G) < t+ bi (log(n/G) +1og2) (32) 


= exp{—2exp{—t}}, t € Rt, 
lim P(a(log(n/G)) n Lı (C) < t+ bi (log(n/G) + log 3) (33) 
= exp{—2exp{—t}}, t€ R}. 


The assertions of Theorem 1-3 remain true if the Lı -test statistics are 
replaced by LSE -test statistics and if in assumption (i) the request of zero 
median is replaced by the request of zero mean and finite absolute moment 
of order 2+ A, A >Q. 

Assertions (26), (32) and (33) are extreme value type theorems. It is 
known that the convergence in (26), (32) and (33) is rather slow. 

The explicit form of the limit distribution in (28), (29) and (30) is known 
only for some weight function q, e.g., for q in (9) with y = 0 in Sen (1981) 
and for q in (8) Siegmund (1987) derived a proper approximation. 

Approximation to the critical values corresponding to (ey ) ee ee 
Tri, and Tiz, can be easily calculated using a pocket aaa 


The tests based on either of oe Pa j = 1,2,3, T? nL, are consistent for 
fixed and as well as some local alternatives. Concerning Ta): Ta 1, (0), 
Ta L, and Tz, their limit distribution depend on 6, and the design matrix. 
This will be aed in a different paper. 

The assumptions (ii) - (iv) imposed on the design matrix are slightly 
stronger than one usually assumes when studying for example Lı estimators 
in the model (1) with 6,, = 0. 
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Concerning the estimator of f(0) we need an estimator that behaves rea- 
sonably well not only under the null hypothesis but also under alternatives. 
Such estimators can be described as follows: 


n—m 


3| > 


f(0) = — fà (0) + f (0), 


n 
where fa (0) and f> (0) are estimators of f(0) based on Yj,...,Y= and 
Yni Yn and m is an estimator of possible change point m. There 


is a number of possibilities how to estimate f-(0), E (0) and M. Here is 
one suggestion 


M = argmazx{||Bk L = Bix, ll; k= l; era 


~ 


i ee ae ae 
fz,(0) = mA So Lx Ban -A Y?n < Y: < xG Be tAn}, 
i=1 


where 7) > 0 fixed, and i (0) is defined accordingly. Under the assumptions 
(i),(ii) and (iv) the resulting estimator f(0) has the property requested in 
the assumption (v). 


3 Proofs 


Since the proofs are quite technical we give only a sketch of them. First, we 
formulate several technical lemmas that are modifications of results proved 
elsewhere. 


Lemma 1 Let assumptions (i) - (iv) be satisfied then for any n > 0 there 
exist Ap > 0 and ny such that for all n > ny 


Ee ( sup{| > (er, (Ei — 0 ?x] t) — pr, (Ei) +07 xf tbr, (Ei — nx} t)) 
i=1 


(34) 
Orc.) lt] < D} > Ayn™) <n 
and 5 
P( sup {|J zar (Ei = nxit) — db, (Ei) (35) 
i=1 


+2f(O)n-¥*xF4)|; {It|| < D} > An?) <n, j= 1,.P, 


for some v > 0 and arbitrary D > 0. 
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Proof: The proof of the first assertion is a simple modification of Lemma 
1 in Gutenbrunner et al (1993) and Theorem 1 in Huskova (1994c), while 
the second assertion follows from Theorem 2 in Huskova (1994c). O 


Lemma 2 Let assumptions (i) - (iv) be satisfied then for any n > 0 there 
exists A, >0 and ny such that for all n > ny 


ce." ak. (Ei)| | (36) 


41=1 


/ 1 
P(|IC; (Bk L =p] = FO) 


A <k,k <n, 


*1/2roe 1 ç*71/2 E;) 37 
P(IICR Bi,n, ~ 8) - Č PEG Ml (687 


> A,(n—k)-?) < (n-k), k <n, 
for some v > 0 and arbitrary D > 0. 


Proof: The proof follows the line of the proofs of Lemma 1 in Gutenbrun- 
ner et al (1993), Theorem 4 in Hušková (1994c) and we apply Lemma 1 of 
the present paper. O 


Lemma 3 Let assumptions (i) - (iv) be satisfied then 


in, P(e as (Dern Eac Erne] enci 
(38) 


he (E) -aC Ern (E) < t+ Blog) 
i=1 i=] 


= exp{—2exp{—t}}, t € RÈ. 
Proof: The proof follows the line of Theorem 1.1 in Horvath (1995). O 


Lemma 4 Let assumptions (i), (ii) and (iv) be satisfied then, as n — œ, 


[nt] 


(CR? Downs (E: );t € (0,1)} =? {(W1 (t), -.., Wp(t))" ;t € (0,1)}, (39) 


where {W1 (t);t € (0,1)},-.., {Wp(t);t € (0,1)} are independent standard- 
ized Wiener processes. If, moreover, (91) is fulfilled then 


k+G 


Jim, P(a(log(n/G)){ , max | 2 tbr, (Ei)|} < t+bı(log(n/G)) +log 2) 
(40) 
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= exp{—2exp{—t}}, t € R! 


and 
k+G k 
lim P(a(log(n/G)){ max 1 È vn Œ- dz, (B)I} (40 
= i=k+1 i=k—G+1 


< t +b; (log(n/G) + log 3) = exp{—2exp{—t}}, t € Rl. 


Proof: Since way i | xip, (E i) is the vector of sums of independent random 
variables with zero mean and finite third absolute moment and since (i), 
(ii) and (iv) are fulfilled the assertion (39) can be derived using standard 
arguments. The assertions (40) and (41) are proved, e.g., in Chen (1988). 
0 


Proof of Theorem 1: We sketch the proof for aA only, the proof for 
9) 


n „Lı ? 
Since Lemma 3 it suffices to show that fied has the same limit distri- 
bution as 


j = 2,3, is omitted because it follows the same line. 


max Ln, k 
1<k<n 


where 


Lng = = (Erva OG y e a Ga 
i=1 


k n 
(Eutr (E) - Cez Sxabn, ()) 
t=] i=1 
Put 


Vne = {2F(0)( — 5 EASA 


i=1 


= 3 Y: — z; Tekn +D- Ti TBatsl) bo R= lyon — 1. 


i=k+1 
Applying Lemma 1 and Lemma 2 we get after tedious but straighforwad 
calculations that, as n — ov, 


La = = log | —1/2 42 
(log ES lee n)e | į Vnk op(( renee n) ) ( ) 


for all a > 0. Moreover, using standard tools we receive also that 


(Ln.k + Vnk) = O(log log n) (43) 


max 
1<k(log n) 
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and 


(Lk EZ Vnk) = Op( V log log n) (44) 


max 
n—(logn)*<k<n 
for some a > Q. 


Combining ((42) - (44) we get that the limit distribution of T$} is the 
same as MaXxı<k<n Ln,k- The assertion (26) follows. O 


Proof of Theorem 2: Using Lemma 1 and Lemma 2 we receive that, as 
n — oo, 


ax [|C a (Sk Lı — Yavu (E ))I = 0,(1) 


kn 


and 


1<k<n 


k 
max |n—/?(52 5, — Dvr, (Es) | = op(1). 
i=] 


The proof can be then finished using classical theorems on the weak con- 


vergence of functionals of partial sums of independent random variables. 
O 


Proof of Theorem 3 By Lemma 1 and Lemma 2 we get after some 
standard steps that, as n — oo, 


k+G 
eee h=” “(Ser T SRL = 2, br, (Ei))| = op((log n)~?). 
The assertions (32) and (33) then follows from (40) and (41) in Lemma 4. 
E 
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Abstract: We show that robust M-estimators as well as equivariant esti- 
mators which do not depend on the extreme observations are inadmissi- 
ble estimators of the location with respect to the Lı loss function for a 
broad class of distributions. As a consequence, it implies that the sample 
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exponential distribution. 
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1 Introduction 


Let X1,...,Xn be a random sample from a distribution with the absolutely 
continuous distribution function F(x — 6), @ € IR‘. The problem is that of 
estimating the parameter 0. We shall assume that the loss L(t, @) incurred 
when estimating 0 by t depends only on |t — 8|, i.e. 


L(t,0) = L(|t — |). (1) 


Then it is natural to restrict considerations to the estimators equivariant 
with respect to the shift in location, t.e. satisfying 


Tr(X1 +¢,...,Xn+¢) =Tn(X1,---,Xn) +e Vee R' and VX ER”. 
(2) 
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Let 7 denote the family of all equivariant estimators. 
Different measures of performance of equivariant estimators were inves- 
tigated. Among them, the probability 


Po(|Tn — 0| > a) (3) 


that the absolute error exceeds a > 0, was considered by several authors, 
either for n — oo and a > 0 fixed or n fixed and a — oo. Bahadur [2], [3], 
Fu [4] and Sievers [12] studied the limit 


1 
Jim. {5 log Fal (Tr — 0| > a)} =e, a fixed (4) 


as a measure of performance of Tn. Sievers [12], who calculated the limits 
e for several estimators and several distribution shapes, found the sample 
median less efficient than the sample mean not only for normal but also 
for logistic distribution and even for the double exponential distribution 
for sufficiently large a. Similar phenomenon was observed by Juretkov ‘a 
[5| who considered the measure of performance 


— log Po(|T;,| > a) 


Ble Te) = pa > a) ' 


under n fixed and a — oo. The estimators which trimm-off the extreme 
observations were found more robust but less efficient than X,, for den- 
sities with exponential tails, including the double exponential. Akahira 
and Takeuchi [1] computed the loss of information caused by trimming the 
extreme order statistics in the double exponential population. 

Denote 


Y = (Y1,..., Yn) where Y; = X; — X1, i= 1,..., n (6) 


the maximal invariant of X with respect to the group of translations of 
Xi,- --, Xn. Then the minimum risk equivariant estimator T (Pitman es- 
timator, MRE) exists provided there exists at least one Ta € T with finite 
risk; then T% could be written in the form 


Ta(X) = Ta(X) — v*(Y) (7) 
where v*(Y) satisfies 
EoL(|Ta(X) — v*(¥))) = min EoL(ITa(X) - v(¥)) (8) 


with the minimum taken over all functions u(y). 
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If the loss function is quadratic, Lo(t,0) = (t — 0)?, the minimum risk 
estimator has the form 


Ty” (X) = Ta(X) — Eo(Tn(X)IY) 


and it equals to the sample mean X provided F is normal. Conversely, Ka- 
gan, Linnik and Rao [9] proved that, provided X,, n > 3, is the minimum 
risk estimator of 0 for some F' with respect to the quadratic loss, then F is 
normal. Otherwise speaking, the equality IEg(X|Y) = 0 characterizes the 
normal distribution. 

If f is not normal, the minimum risk estimator of 0 is typically nonlinear 
and only the sample mean is a sum of independent summands. However, 
many estimators admit asymptotic representations of the type 


Tn(X) =O+ Z Sx: — 0) +op(n7!?) as n= 00 (9) 
i=1 


with appropriate functions w. Pepe if T,(X) is an asymptotically eff- 
cient estimator, then y(x) = ita 7 a, with f being the density of F and 
I(f) its Fisher information. A systematic study of the representations of 
this type could be found in [7]. Jurečkovà and Milhaud [6] recently proved 
that if the equality Eo} z- Y(X) |Y) = 0 holds for n ; 4 re for a func- 
tion 4% satisfying some regularity conditions, then w(x z e R}, 
where c is a constant and f is the density of F. This ula = ofa}: = hat 
not oo ayp T, but also in the finite sample case, many properties 
Obs al ae -FaN under F are in correspondence with the respective 
properties of the sample mean under the normal distribution. 

In the present paper, we shall consider the performance of some robust 
estimators with respect to the Lj loss, t.e., 


Li (t, 6) = |t — 4l. (10) 


Zinger, Kagan and Klebanov [13] and Kagan and Zinger [10] (see also [9], 
Section 7.9) proved that if X, is the minimum risk estimator of @ with 
respect to Lı-loss for f(x — 0), f unimodal and n > 6, then the underlying 
distribution is normal. This means that, for the normal distribution, X 
is the minimum risk estimator of J with respect to both Lı and Lə loss 
functions and in both cases its admissibility it is a characteristic property 
of the normal distribution. 

If f is unknown then we prefer robust estimators which are not connected 
with a fixed density shape. However, if we know f, then we are interested in 
admissible estimators whose risk cannot be uniformly improved. We shall 
show that the robust estimators are not admissible under the Lı norm for 
a broad class of densities. 
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2 Inadmissibility of trimmed estimators 


Let X1,..., Xn be a random sample from a distribution with the continuous 
density f(x — 9) such that f(x) > 0, x € Rt. Let Xn: <... < Xnin be the 
order statistics corresponding to X1,...,Xn. Let J* C T denote the set of 
equivariant estimators satisfying the following condition: 


(A1) F= ToX Orr were, niee ti < ... < ik, 1 < k < n, then 
Xni < Thn < X nig 


(A2) Ta(0,...,0) = 0. 
The following theorem shows that the trimmed estimators as well as the 


M-estimators with a score function constant outside a bounded interval are 
inadmissible for unimodal densities. 


Theorem 1 Let X1,..., Xn, nÈ 5, be independent observations from a 
distribution with the density f(x—0), where f(x) > 0, x € R}, is unimodal, 
t.e. increasing for x <0 and decreasing for x > 0. 


(i) Let Ta E€ T* be an equivariant estimator, T,(X1,.-.,Xn), continuous 
in each argument, constant with respect to Xn:1, Xn:2, Xn:n—1 and Xnn, but 
uniquely determined as a function of Xn:3,...,Xnn—2. Then Tn is inad- 


missible estimator of 0 with respect to the loss (10). 

(ii) Let Mn be an M-estimator generated by a continuous non-decreasing 
function w such that y(x) = (c) for x < cı and v(x) = Y(c2) for 
£ > C22, Cı <O0< co, as 


1 
Mnr = 5 (Mn + M3) where (11) 


M, =sup{t: wx ;-t)>0}, Mt =inf{t: W(X i —t) < O}. 


i=l 


Then Mn is inadmissible as an estimator of 0 with respect to the loss (10). 


Proof: In the case of Lı norm, (7) specializes to 
T} = Tn — medo(Tnly) (12) 


where medo(7;,|y) stands for any conditional median of Tn given the max- 
imal invariant y under 0 = 0. Hence, Tn is the MRE provided 


medo(T;ly) = 0 (13) 


and the median is unique. 
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(i) Let Tn E€ T be uniquely determined and do not depend on X71, Xn:2, 
Xn:n—1 and Xn:n Denote 


YS Mise) St iene = 1 (14) 
Then Y is the maximal invariant for the group of translations of X1,...,Xn. 
The conditional distribution of Ta given Y = y has the density 
[Tier Fé + yi 
g(tly) = odin At we) (15) 


ee [Tina f(z + yi)dz 


The condition (13) rewrites in the following way: 


[. [1 f+ dae = [rewa a.s. |F]. (16) 


In view of continuity of the density f and the estimator Tn previous equa- 
tion holds not only for almost all but for all y;. 
Denoting w(t) = sign t, t € R}, we rewrite (16) in the form 


/ ~ a(t) II f(t +y,)dt = 0. (17) 
SLS i=1 


Denote A = {y : y2 = ... = Yn-2 = 0, yr < yo < 0, Yn > Yn-1 = 0}. Then 
Tn(y) = 0 for y € A independently of the values of y1, Y2, Yn—1, Yn and (17) 
takes on the form 


[wl seta) flt+y2) unf +y) = 0, y eA. (18) 


— 00O 
Differentiating (18) in yv, v = 1,2,n — 1,n gives 
f VOEE TT se+w rd =0, v=1,2n-1,n. (19) 
- iV 


Integrating the left-hand side of (19) by parts for v = 1 and using (19) for 
v =2,n—1,n, we obtain 


oe FOU) gee ss 
ma) [wero TT Er oaz (20 
ares i=1,2,n—1,n F) 
If we especially take three following choices of y € A: 
Yı = y2=u <0, Yn-1 = Yn =9 
yr <= u < 0, y2 = 0, Yn-1 = 0, Yn =v > 0 
Yı = Y2 =l, Yn-1 = Yn =v > 0, 
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and subtract twice the second equality from the sum of the first and the 
third ones, we get 


oO u v 2 
[wore a z a a A z a |" p83 (t)dt=0. ey 


Because w(t) f’(t) > 0 for t Æ 0, (21) implies that 
ft+u) f(t+v) 


oS E E 22 
fa) ~ Fe oe 
holds for allt Æ 0, u < 0, v >0, and this in turn implies that 
t 


f(u) fœ) 


holds for all t 4 0, u,v € R!. By the Cauchy equation, the only function 
satisfying (23) is either the exponential function or the constant. By the 
unimodality assumption on f, Vv > 0 there exists u < 0 such that f(u) = 
f(v), and then (22) leads to the constant f, what is a contradiction. Hence, 
there exists at least one y* € A such that either 


[_Tse+unae< [Treat (24) 


[or the opposite inequality] holds for y*. Then, because of the continuity, 
(24) [or the opposite inequality] holds in a neighborhood of y*, hence (16) 
is not true a.s. |F] and Tn is not admissible. 

(ii) Let Mn be the M-estimator defined in (11). Put Y; = Xi — Mn, i = 
1,...,n. Then Y is the maximal invariant with respect to the group of 
translations and the conditional density of Mn given Y = y has the form 
(15). Analogously as in the part (i), M, would be the MRE under the 
condition (17). Let B = {y : y5 =... = Yn = 0}; proceed analogously as 
in (19) - (21) and take successively the following choices of y,..., Y4 : 


/ 
yı = Y2 S C —-U, ¥3=—=Y¥4=— C1 —U, 
/ / 
yi Cı — U, Y2 = C2 +V, Y3 =C uU, Yyy =c +v, 
/ 
yi = Y2 = C2 + V, Y3 =Y4 = C2 +V, 


u,u',v, v’ > 0. Analogously as in the part (i), we arrive at the equation 


i rrflt+a -—wft+a—v) 
J. an fici —u)fla— u’) 
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F(t teatu)f(ttea+v')’ 
flez +v) f(c2 +v’) 


Vt € IR!, u,u',v,v' > 0. Quite analogously we get 


© npp ftta-uflttot) 
[ wore: Fla uflat v) 


f (t)dt=0 (25) 


f(t + cez +v)f(t+ci— u) 
f(c2 +v) f(c — u’) 


Vt € Rt, u,u',v,v' > 0. Similarly as in part (i), we first conclude that 
the density f should be then either exponential or constant in the tails and 
(25) finally to leads to the constant tails, what is a contradiction with the 
conditions imposed on f. Thus, there exists y* € B and hence also its 
neighborhood satisfying either (24) or the opposite inequality; we conclude 
that Mn cannot be an admissible estimator of 0 with respect to Lı loss. O 


EERI = 0 (26) 


Notice that Theorem 1 covers the trimmed L-estimators, the sample 
median as well the linear combinations of several (non-extreme) sample 
quantiles; it also covers the Huber estimator and the related M-estimators. 
The results also imply that the sample median is nor admissible even for the 
double exponential distribution for which it is the maximum likelihood esti- 
mator. Similarly, while Huber’s M-estimator is MLE for the contaminated 
normal distribution, it is not admissible for the same. 
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Abstract: The paper discusses the behavior of residuals from least-absolute- 
deviations (or L+) fits of linear models. Particular emphasis is given to 
data arising by way of designed experiments. The paper argues that the 
L! method of fitting such models should be discouraged. The method is 
inefficient when compared to other robust methods while not being any 
simpler to compute. The residuals obtained by L1 fitting exhibit several 
weaknesses. First of all they are ambiguous in the sense that there are 
a multitude of L! fits, sometimes quite far apart. Second, typical algo- 
rithms produce as many exact zero residuals as there are contrasts fitted. 
As a result, the non zero residuals do not give an accurate reflection of 
the errors that occurred during the experimental runs. 


Key words: Least absolute deviations, factorial designs, outliers detec- 
tion, uniqueness of fit. 7 
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1 Introduction 


Let y = X0 + € be a linear model with uncorrelated, centered, and ho- 
moskedastic errors €],...,€n. As indicated, we take n to be the number 
of observations, whereas p denotes the dimension of the regression param- 
eter 0. The least-squares residuals are r = (I — H)y, where y = Hy = 
X(X7TX)-!XTy is the least-squares fit. If the error distribution has two 
moments, it follows that E(r) = X0 — HX0@ = 0, and Var(r) = o?(I — H). 
The use of such residuals for outlier detection and other diagnostic pur- 
poses has been explored in great detail in the statistical literature (see for 
example Belsley, Kuh and Welsch, 1980; Cook and Weisberg, 1982). 
Residuals from an L} fit are not so easily described. Throughout this 
article, we denote by Oa parameter fit obtained by minimizing the least- 
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absolute-deviations and designate by e the corresponding residuals e = 
y — X0. Because R™ equipped with the L! norm is only a weakly convex 
normed linear space, the best approximation to y of the form X@ is in 
general not unique. In fact, the best approximations form a convex set 
in the p-dimensional column space of X. It is widely-known that among 
the best approximations there is always one, for which at least p of the 
components of e are exactly equal to zero. The algorithms based on linear 
programming techniques always identify one of these solutions, because 
they correspond to extremal points of the linear programming problem. 

For Gaussian errors, the least-squares fit Ô is fully efficient, whereas 
the least-absolute-deviation fit 9 reaches an asymptotic efficiency of 2 [n= 
63.7%. In balanced factorial models the element h;i; of H; proportional to 
the covariance of the least-squares fit 9; and y;, depends in a simple way on 
the factor settings at runs i and 7. In the case of a two-way ANOVA with 
factors fı and fo, for example, there are four cases, distinguished by the 
comparison of (fii, foi) and (fij, foj). In particular, the diagonal elements 
hii are all equal to p/n, where p is the dimension of the column space of 
X. In the following, we restrict our discussion to this case of a balanced 
factorial model. 

The asymptotic behavior of the residuals is for L! and L? the same, 
as long as p/n tends to zero with increasing n. The residuals are asymp- 
totically equivalent to a sample from the error distribution. Asymptotic 
considerations are, however, of minor interest when one discusses proper- 
ties of residuals. The case p + n is of much more practical concern. 


2 Identifying a small flock of outliers 


Least-squares residuals have a tendency to behave much like a sample from 
a Gaussian distribution. Stem-and-leave plots or normal plots do often not 
reveal anything of interest. This is due to the dependence imposed on the 
residuals by the requirement that rT X = 0. For that reason, glaring error 
structures will be lost or not faithfully translated into residual structures. 
An example of this sort concerns the presence of a few outliers among the 
measurement errors. To illustrate what happens, suppose the residual r has 
a fixed size, e.g. rr = 1 and we seek to maximize wfr for a fixed vector 
of component weights w. The solution to this constrained optimisation 
problem yields a maximal value of w!w — w! Hw achieved, when r œ 
(I — H)w. Thus, if we maximize a single component of a (unit) least- 


squares residual r, the largest possible value is V1 — hi = y(n — p)/n. If 


we maximize the sum of two components, the largest value for the sum of 


the ith and jth residuals is 4/2 — hii — hj; — 2hij, etc. 
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Example 1 A simple illustration of these facts can be given by using the 2? 
main effects model. If all observations are zero except one, which is equal 
to c, the residuals are equal to tc/4. This residual vector has L*-norm 
equal to c*/4 and c = 2 normalizes it. The largest possible component of 
a unit residual is, therefore, equal to 1/2 = ,/(4 — 3)/4, which means that 
the largest percentage of the L?-norm of a residual that resides on a single 
component is 25%. 


In a 3 x 3 main effects model, the corresponding number is 2/3 = 
y (9 — 5)/9, which implies that the largest percentage of the L?-norm re- 
siding in a single residuals is equal to 44.4%. These examples illustrate the 
fact that the ability of a design to show a single outlier by way of a large 
individual residual depends on the ratio p/n. 

The general formula given above can be used to judge the ability of a 
given design to point out in a single experiment two outliers by two large 
residuals. It is immediately clear that this capacity is dependent on the 
positions (on the indices) of the outlying observations, since h;; depends 
on 2 and on 7. 

The picture is maybe clarified, if we pose the question differently. Given 
a residual vector r, what maximal percentage of its L?-norm rfr can be 
explained by 1, or 2, or 3, etc. components. Let J = {tj,...,im} C 
{1,...,n} denote a set of m indices. It turns out that the answer to the 
above query is equal to 1 — Amin(J), where Amin(J) denotes the smallest 
eigenvalue of the minor Hy of the hat matrix determined by the intersection 
of the rows and columns from I. This is easy to show and we leave it to 
the reader to check the statement. 


Example 2 When the number m of residuals we wish to check is equal to 
1, the minor Hy is equal to the scalar hi, i = p/n and 1—Amin(1) = 1—p/n, 
which is a result we already knew. If we pass to m = 2 for the 2? design, 
all minors of dimension 2 have a minimal eigenvalue of 1/2. The largest 
percentage of the total norm explained by two components, i.e., by half the 
components, is equal to 50%. The situation in the 3x3 case is different. The 
smallest eigenvalue of 2-dimensional minors is either 1/3 or 4/9, depending 
on the position of the pair within the 3 by 3 table. The maximal percentage 
of the total norm that can be explained by 2 of the 9 residuals 1s, therefore, 
66.7%. Since with a single component, one can at most explain a percentage 
of 44.4%, this is a bit disappointing. Evidently, two outliers will result in 
two residuals that stick out much less than was the case with a single outlier. 
This kind of behavior is typical for least-squares fits. 


What is the answer to the same question in the case of least-absolute- 
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deviations residuals e? For values of m smaller than the exact-fit-point 
of the L! method, the residuals can be completely concentrated on any of 
set of m components. The exact-fit-point Nef of an equivariant fitter can 
be defined as the largest number of non zero obserservations that one can 
add in any position to the vector of observations y = 0 without changing 
the fit from y = 0. In the best of situations, this point is equal to or close 
to n/4 — 1 for the least-absolute-deviations regression method (for details, 
see Ellis and Morgenthaler, 1992). 

A thorough discussion of the break down and oulier resistance problem 
in designed experiments is given in Miiller (1995). If we wish to be able to fit 
all contrasts in a given model, the crucial quantity is the maximal number 
of experimental runs that by themselves are not enough to determine a 
fit of the model. In the 3 x 3 main effects design this number is equal 
to six, which is bigger than the five dimensions of the parameter space. 
Any equivariant fitter breaks down, as soon as a majority of the 3 = 9 — 6 
remaining observations are faulty. 

For m > Nef, depending on the position of the outliers, different out- 
comes are possible. 


Example 3 In a 3 x 3 design, Ne = 1. Form = 2 and m = 3, the 
following tables show some of the possiblities. 


ojoo, [Of oto} ola] el 


In the left-most case, the residuale has in general two non zero components. 
They can be in the same position as the outliers a and b — — this happens 
when they are of opposite sign — — or they can be spread to other positions, 
one at the third position of the first line, the other indicating the more 
important outlier among a and b. In the middle case, the non zero residuals 
are confined to the two positions where a and b are observed as long as they 
have the same sign. Otherwise, the L! fit is not unique and a second and 
third non zero residual can pop up in the first and the last line. If a is 
—300 and b is 290, the residual table can be as disparate as the follwing two 
examples: 
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Since the vector 


lies in the column space of the design matrix X, one can smoothly transform 
between these two residual vectors without changing the L! norm. In the 
right-most case, the situation is even more complex. If the outliers a, b and 
c are of equal sign, the residual matrix faithfully reflects this structure. If 
they are of unequal signs, surprising things can happen. The observed table 


In this case, there exists an additive fit explaining the data with (merely) 
two sizable residuals. When m = 3, we are beyond the range, where we 
can generally expect to distinguish outliers from additive structure in 3 x 3 
tables. Most robust procedures prefer the fit found by the L! method over 
the fit ğ = 0. But one can, of course, imagine procedures that are able 
to identify any additive structure as long as it is exactly adhered to by a 
majority (LMS, Rousseeuw and Leroy, 1987). Such a procedure could not 
distinguish between the two fits which both have at least 5 of the 9 residuals 
equal to zero. 


3 Maximal residuals 


The size of the largest residual will often be taken as an indication, whether 
faulty runs occured during a designed experiment. As we saw in the last 
section, when only a small number - less than the exact-fit-point of the runs 
are faulty, and stick out very clearly, then the L+ fit will produce residuals 
that can safely be used to identify the faulty runs. Beyond this number 
of grossly wrong observations and when the outlyingness is less clear cut, 
the L! method is not successful. We also noted in the last section that 
the L! fit has two drawbacks. Firstly, it does in general not produce a 
unique answer. This may at first sight seem not to be a concern, but in 
the case of designed experiments, multiplicity of possible answers is very 
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common and the set of L! solutions can be very varied. Secondly, the 
solutions found with the known algorithms produce at least p exact zeros 
among the residuals. Such least-absolute-deviation fits are typically not 
very appealing when one analyzes them in detail. The greed for exact zeros 
tends to make the non zero residuals “too” big. 

There are several simple arguments which allow us to estimate the aver- 
age inflation factor that we should expect when passing from least-squares 
residuals to least-absolute-deviations residuals. First, the L} criterion will 
be about the same between the two solutions, i.e. Xn leil < Pz |ril 
with rough equality for large values of n/p. If this were so and if we further 
imagine that the exact zeroes are created by randomly selecting the cells, 
then the fact that the first sum contains p exact zeros makes |e;| on average 
n/(n — p) times larger than |r;|. This is equivalent to imagining that the 
non-zero L! residuals are constructed from a L? residual to which p/(n — p) 
parts of p/n LŽ residuals are added. Both of these arguments over-estimate 
the inflation factor. A final check can also be made on the level of the vari- 
ance. Suppose that a random selection of (n — p)/n of the L? residuals 
were multiplied by the above inflation factor, i.e., n/(n — p), whereas the 
others were put equal to zero. Under such a process, the variance of the 
non zero L! residuals would be equal to (n/(n — p))? times the variance 
of the L? residuals. If we take the rough equality of the L? criterion as a 
guide, we would expect the variation in e to be the same as the variation in 
r. Since the components in e contain a mixture of p/n exact zeros with zero 
variance and (n — p)/n non zeros, the variation of the non zero residuals is 
expected to be n/(n — p) times bigger than the variation in r. This leads 
to an inflation factor of ,/(n — p)/n, but this time one underestimates the 
true size. Both factors tend to 1, as p/n tends to zero and both are true 
some of the time. Typically, when we have only a few degrees of freedom for 
the error, then the factor n/(n — p) is correct. This is the case for example, 
when we fit a 2° design up to the (k — 1)-factor interactions. But, it is also 
roughly true for a 2 x k factorial design, where the number of degrees of 
freedom for the error is k/2 and thus arbitrarily large. The last example 
shows, that it is also the design itself that has an influence on the behavior 
of the L! method. 

In the 3 x 3 design, the inflation factor for the size of residuals is between 
/9/4 = 1.5 and 9/4 = 2.25. Figure 1 illustrates what really happens for 
four simple designs. Note that an innocent interpretation of the L! residuals 
e would quite often lead to the conclusion that outlying experimental runs 
were present, because of the large maximal residual size. The simple minded 
adjustment given above works reasonably well. 

Figure 2 repeats the experiment explained in Figure 1, but this time 
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with errors from a contaminated Gaussian with 1/n contamination having 
a five-fold standard deviation. 

Among the 300 simulated experiments, roughly 60% resulted in the cor- 
rect identification of the faulty run in the sense that the largest sized resid- 
ual was indeed associated with the contaminated run. This is true for both 
methods of fitting. However, in the least-squares case, the largest residual 
does not stick out clearly when compared to the next largest one. 


2 by 3 (2 df) 3 by 3 (4 df) 


23 4 5 6 


1 


O 
Least Sq Least Abs Dev Ratio Least Sq Least Abs Dev Ratio 


2 to 3 (4 df) 4 by 4 (9 df) 


Least Sq Least Abs Dev Ratio Least Sq Least Abs Dev Ratio 


Figure 1: The figure shows the behavior of the maximal residual size in 
four different design. The expected inflation factors are (1.73, 3.00) for 
the 2 x 3, (1.50, 2.25) for the 3 x 3, (1.41, 2.00) for the 23 main effects 
and (1.33, 1.78) for the4 x4. The average inflation factor of the maximal 
residual observed in 300 replications are 2.54, 1.98, 1.61 and 1.51. The 
ratios between the maximal residual sizes computed for each replication is 
also shown in the plots. These ratios are surprisingly stable. 


Compared to the least-squares residuals, the L+ residuals do fairly well. 
But, when challenged by a standard robust estimator, they do worse. If 
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one uses the simulations shown in Figure 2, but replaces the least-squares 
residuals by robust residuals based on Tukey’s biweight with 6x MAD, 
both produce about an equally large maximal size. However, the ratio 
of largest to second largest is usually more important for the robust fit, 
which, therefore, results usually in a clearer picture. This is due to the 
higher degree of smoothness of such estimators when compared to the L? 
fitter. This also results in an improved relative efficiency. 
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Figure 2: The figure shows the behavior of the mazimal residual size in 


four different design for an error distribution which is Gaussian in all runs 
except one. In the exceptional one the variance of the Gaussian error 1s 25 
times larger. The average inflation factor observed in 300 replications are 
9.61, 1.97, 1.75 and 1.58 and thus remarkably close to the ones observed for 
Gaussian data. The similarity with the Gaussian case is a bit surprising, 
since the L! method is supposed to be able to identify more clearly the faulty 
run. In order to check this, each figure contains a boxplot of the ratio of the 
largest sized residual to the next largest one — the corresponding boxplots 
are labelled “Out L1” and “Out L2”. 
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4 Non uniqueness of the L! fit 


The ambiguity of the L! fit is another problem that the user of this method 
should be aware of. In the case of the 2 x 2 main effects model, for example, 
the common L! algorithms will result in a residual e consisting of three 
exact zeroes and one non-zero residual whose size is exactly four times as 
large as the size of the L? residuals. The run in which the single non zero 
residual is placed, is completely arbitrary. All intermediate fits preserve 
the L! norm. The same problems occurs in sufficiently balanced replicated 
2 x 2 designs such as the one presented in Sheather and McKean (1992, 
Example 2, p. 153). In their example, the L! solution is not unique. In 
fact, there is a 1-dimensional family of fits, which contains in the center a 
solution close to the L? fit. 

In the 2 x 3 design, the L! fit results in general in 2 non zero residuals, 
which are placed in two different columns. The placement of the these two 
residuals is arbitrary to the extent that we can make them switch rows. All 
intermediate solutions preserve the L! norm. In the 3 x 3 design, things be- 
come more complicated. In general, there are four non zero residuals. The 
values and placements of the non zero cells are usually not unambiguously 
determined. If the pattern of the non zero cells — indicated by x — is of the 
form 


there is in general a 2-dimensional set of L! solutions. 

If we want to recommend the use of L! fitting, it seems important to 
me to produce an algorithm which enumerates all extremal points of the 
polygone of L fits instead of simply picking one, somewhat at random. 
This will allow the user to judge, inhowfar the criterion is really identifying 
outlying points or whether it simply produces large residuals by artificially 
zeroing others. 


5 Estimating error variation 


The undesirable features of the L! residuals that we have discussed above 
will, of course, have an effect on their ability to predict the error variability. 
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The p exact zeroes among the residuals are a property of the design, the 
fitter and the algorithm. They contain no information about the error 
distribution. It is, therefore, quite natural to compute the variance of the 
non zero L! residuals as an indication of the variance g? of the errors. 
Consequently, consider an L} solution with at least p zero residuals and 
at most n — p non zero residuals. It is evident that the sum of squares of 
the non zero residuals over-estimates the error variation and the question 
is by how much. To answer this question, suppose we tried to reconstruct 
the L? residuals r on the basis of the L! residuals e. If p = n — 1, then e 
would contain a single non-zero residual, which we would evenly distribute 
over the n |r;|. These reconstructed L? residuals would then have a sum 
of squares and, since there is a single degree of freedom, a mean square 
equal to 7%, e?/n. Now, suppose we have several non zero L! residuals, 
which are all of equal size |e;| = R. Evenly distributing them over all n 
observations, leads to a size of |r;| = R(n — p)/n. The mean square of these 
reconstructed r; is equal to (n — p)/n* 0%, R? = 7, e?/n, i.e., the 
same formula as before. One proposal for estimating the error standard 
deviation from an L} fit would, therefore, be the following: 


Because of the non-uniqueness, )>;"_, |e;|, which is uniquely determined, 
is in some sense a more appropriate basis for estimating o, the standard 
deviation of the error. In this case, the over-size of the L! residuals does not 
play any role either. If we re-size them and in some way reconstruct pseudo- 
L? residuals, their sum of absolute values would remain the same. How 
would one estimate o based on a set of L? residuals r? Since the marginal 
distribution of each r; has expectation zero and variance o*(1 — hy) = 


o*(n —p)/n, we have — for Gaussian errors — E(|r;|) = /2/7./(n — p)/no. 


The statistic 
2 n 
D Yain- p) ae Tey M 


is, therefore, an unbiased estimate of a. In replacing )77_, |ri| by Yz leil, 
we obtain an estimate that underestimates ø, but should still be a useful 
indicator: 

m/2 k 


eooo 


Figure 3 shows with various plots, how the two estimates behave for 
Gaussian errors. 
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A similar behavior is found for other error distributions with a finite 
second moment. 


Figure 3: The boxplots show the behavior of the two estimates u and v for 
various balanced designs and Gaussian errors. The labels indicate for each 
boxplot the number of observations n and the dimension of the parameter 
space p. Each design is represented with two bozplots, the first one for 
u, the second one for v. The true value of o is equal to 1. The average 
values over 300 replications are: 0.811, 0.778, 0.843, 0.819, 0.877, 0.968, 
0.885 for u and 1.02, 0.858, 0.888, 0.803, 0.897, 0.984, 0.887 for v. Both 


estimates tend to underestimate o. 


6 Conclusions 


The L! method has several drawbacks and in particular leads to some 
undesirable features built into the residuals. In my opinion it is not a 
suitable method for fitting of ANOVA data the following reasons: 


(1) The computation of the L! fit is not as easy as the computation of the 
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L? fit. It is comparable in difficulty to robust fits. 


(2) The L! residuals have some idiosyncrasies that should be known to 
the user of this method. Ignorance will lead to wrong interpretations. 
They cannot in a straighforward manner replace L? residuals. 


(3) The non-uniqueness of the L! fit in ANOVA problems is the rule rather 
than the exception. We lack easily available algorithms which exhibit 
the whole solution set. 


(4) The resistance of the L! fit to outliers is not good enough. One can do 
better with competing methods that are computationally about equiv- 
alent. 
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Abstract: In a linear model Ypy = 21 y8+Znn,n = 1,..., N, the problem 
of testing the hypothesis Hp : LG = l versus Hı : LB Æ | is considered. 
As tests Wald-type tests based on asymptotically linear estimators are 
used. For such tests the asymptotic efficiency at the ideal model and 
the asymptotic bias caused by outliers or other deviations from the ideal 
model depend only on the influence function of the underlying estimator. 
As for estimation most efficient robust tests can be found by maximizing 
the efficiency under the side condition that the bias is bounded by some 
bias bound b. But this has the disadvantage that the solutions depend 
on the bias bound b. To determine b one can regard measures which are 
composed by the efficiency and the bias. For estimation such measure is 
the mean squared error while for testing the power relative to the bias 
is used. It is shown that the L,-tests, i.e. Wald-type tests based on the 
L,-estimator, maximize this relative power. This result is in opposition 
to that for estimation where the Lj-estimators do not maximize the mean 
squared error. 


Key words: Linear model, L;-test, bias of the level, relative power. 


AMS subject classification: Primary 62F35 ; secondary 62J05, 62J10, 62K05. 


1 Introduction 
A general linear model 
Yn = XnNG4+ ZN 


is considered, where Yy = (Yin,.--- Ynn)” is the vector of observations, 
6 € JR” an unknown parameter vector, Xy = (11N,.--,2Nn)* e RNs 
the known design matrix with regressors 71y,...,%Nn € R" and Zyn = 
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(ZıN,---, ZNN). the vector of errors. A realization of the random vector 
Yy is denoted by yn = (yın, --- ,yNN). In this linear model a hypothesis 
of the general form 


Hos LB=! 
shall be tested versus the alternative 
Hi: LB, 


where l € JR® and (8) = Lf is a linear aspect of the unknown parameter 
vector 6 with given matrix L € JR*®*" of rank s. 

A large class for testing the hypothesis Ho : LG = l is the class of Wald- 
type tests based on asymptotically linear estimators, briefly called ALE- 
tests (see Müller, 1992a,b, 1995a,b; Rieder, 1994, p.153). To define these 
tests we assume that the ideal distribution of the standardized errors Zny/0 
is P and that the design 71y,..., £Nw is converging to an asymptotic design 
measure 6 in the following sense: 


1 
lim © Y enx ({2}) = ôr) 


for all x € supp(6), where supp(6) is the support of 6 and e, is the Dirac 
measure on x € JR". Then the ALE-tests have a test statistic of the form 


tn(yn, Xn) = N (@n(yn, Xn) — 1)" Cn(yn, Xn) (Gn (yn, Xn) — 1), 


where y is an asymptotically linear estimator for y(G) = LG with influ- 
ence function Y% and C (yn, Xn) is a consistent estimator for the asymptotic 
covariance matrix of Gy, i.e. of oĉC (y, 6) with 


CU, 8) = | H(e,2) Y(z,2)" P(dz) 8(de). 
Thereby an estimator y for y(Z) = LE is called asymptotically linear 


with influence function w : IR x IR" — R° if f |y(z, x)|? P(dz) (dx) < ov, 
f(z, £) P(dz) = 0 for all x € supp(6), f(z, x) 27 z P(dz) 6(dx) = L and 


nf. N YnN — x NBN = 
rst o (tat =slaoeyy)| >) = 


for alle > 0,0 € IR* and By = B+N—1/28 with B, B € IR" and LB = l. The 
set of all influence functions is denoted by Y. Many wellknown estimators 
as M-estimators and R-estimators are asymptotically linear. 


lim P, 


N-oo 
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In particular the classical F-test is an ALE-tests, where Gy is the Gauss- 
Markov estimator and y(z,z) = LI(é)~xz. Thereby M` denotes the 
generalized inverse of the matrix M, i. M MTM = M, and I(6) the 
information matrix, i.e. 


I(ô) = / xa? §(dz). 


If p(z, x£) = Yt (z, £) := LI(6)~xsgn(z) J then it is the influence function 
of the Lj-estimator and the corresponding ALE-test is called L1-test. 

If the standardized errors Z;y/o,...,ZnNNn/o are independent and iden- 
tically distributed according to the ideal distribution P, then under Hp the 
ALE-test statistic Ty has asymptotically a central chi-squared distribution 
with s degrees of freedom. Hence, the critical value of an asymptotic level a 
ALE-test can be determined as the (1—«a) quantile of the chi-squared distri- 
bution. Under contiguous alternatives of the form By = B+ N7126 € IR" 
with LBy = 1+ N—1/2y the ALE-test statistic has asymptotically a chi- 
squared distribution with noncentrality parameter yf lo? C(w,6)|~ty so 
that the power of the test is an increasing function of yt C(w,6)~'4. 

If the errors Z1y /o,...,ZnNNn/o have instead of the ideal distribution P 
distributions which are contaminated by outliers and other deviations, then 
under Hp the asymptotic error probability can exceed the level a. The max- 
imum bias of the level, which is possible under contaminated distributions, 
is an increasing function of 


lyt C, 8ta = MaX(z,7)ERxsupp(6)V (2; x)" C(w, 5)~*b(z, x) (1) 


See Miiller (1992a,b, 1995a,b), Rieder (1994), Heritier and Ronchetti (1994). 
Let the ideal distribution P be the standard normal distribution. In 
Miiller (1995a,b) the question was considered which ALE-test has maximum 
power under the side condition that the maximum bias is bounded by some 
bias bound. Expressed by influence functions this means: Which w € V 
maximizes y7 C(7,6)~'y for all y € JR® under the side condition 


Iy” CW, 5) Ylle < b, (2) 


where b is some given bias bound. But this means that the matrix C (ẹ%,8)~t 
should be maximized in the positive definite sense under the side condition 
(2). In general this optimization problem has no solution (see Krasker and 
Welsch, 1982). To find solutions one can regard instead of the whole matrix 
C(w,6)—* functionals of the matrix. An appropriate functional for testing 
is the determinant of the matrix. In Miiller (1995a,b) it was shown that for 
maximizing det(C(w,6)~*) under the side condition (2) solutions can be 
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derived if the design 6 is D-optimal, i.e. 6 € arg min{det(L I(5)~L"); 6 € 
A}. But the solutions depend on the bias bound b so that the question exists 
how to choose b. Here in Section 2 we propose a criterion for choosing b 
based on a relative power value. We show that the best ALE-tests with 
respect to this criterion is the L,-test. In Section 3 we compare this result 
with a corresponding result for estimation, and in Section 4 we give an 
example. 


2 Tests with maximum relative power 
Müller (1995a,b) showed that the influence function y» given by 
9? 


L I(6)~x sgn(z) mill soul, for b> s, 


L I(6)~x sgen(z T ae 
Wp(z, z) -| (8) gn( ) y3 


with 
1 
yg = = 9(Vb y) >0 


is a solution of the problem of maximizing the power criterion det(C (y, 6)~+) 
under the bias side condition (2) if 6 is a D-optimal design. Thereby ® de- 
notes the distribution function of the standard normal distribution and g 
is given by 


gly) = | min{\z|,yPP(d2). 


Note that y» with b = s is the influence function w! of the L,-test and that 
s = bmin := min{|| Y C(y, tY]; Y E€ VU}. Recall that s is the rank of 
the matrix L € JR®*". 

For every solution y» we have that the quantity (1) providing the max- 
imum bias satisfies 


bp C (wo, 5) oll = b 
and that the power criterion satisfies 
det es 
SOMO) = aetna (6)- BE) 
Thereby u : [1,00) — (0,00) is defined by 
(26(w(a))—1)? 
u(a) = » S&a) ORA, 


= fora=1, 
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where wW(a) > 0 is implicitly given by ag(w(a)) — w(a)? = 0. In particular 
we have (2) = Vby,. With the implicit function theorem it can be shown 
that u is an increasing function (see Müller, 1995a). This means that the 
power of the ALE test based on yy increases when the maximum bias value 
b increases, and vice versa. 

For an appropriate choice of the bias bound b we can set the power 
value det(C(%,,6)~*) in relation to the maximum bias given by (1). There 
are in principle two possibilities: 6 should be chosen so that the difference 
between the power value and the bias value is maximized, or b should 
be chosen so that the ratio of the the power value and the bias value is 
maximal. Maximizing the difference 


u(t)" 
det(C' (Y, 6)~*) — lly Clo, 6) Ylle = moo) — b (4) 


has the disadvantage that the solution would depend on the formulation of 
the hypotheses. Namely, if we use instead of the hypothese Hp : LG = l 
the equivalent hypotheses Hp : ALG = Al with A Æ 1, then we have to 
use pp := A Wp instead of y». While the bias value (1) is invariant with 
respect to A, this is not the case for the power value so that a solution of 
maximizing a difference like (4) would depend on A. This problem does not 
appear if we use the ratio of the power value and the bias value. Hence, b 
should be chosen so that the relative power value 


det(C (yp, 8)1) 1S 1 (5) 
bs Cle, 6) tolle b- det(C (a, 5)) 1/8 
is maximized. Thereby we take the sth root of the determinant of the 
covariance matrix to ensure that an improvement of the covariance matrix 
by a factor c provides also an improvement of the relative power value by 
the factor c. Note also that the sth root of the determinant is often used as 
a measure for the entropy and that it is the geometric mean of the diagonal 
elements if the covariance matrix is a diagonal matrix. 
The following theorem shows that the L1-test, i.e. the ALE-test based 
on Y with b = s, has the maximum relative power. 


Theorem 1 b = s mazimizes the relative power value (5) with respect to 
b, i.e. the L,-test has maximum relative power. 


Proof: Using (3) maximizing of (5) is equivalent to minimizing 
1 
b. det(C (y, 6))*/8 =b. uw (2) det(L I(6)~L7)*/s 
Uu — 


= t (+) saed i 
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where 


t(a) := i id > 1. 
u (a) 
By calculating the first and second derivative of t : [1,00) — (0,00) it can 
be shown that t is a convex function with lim,,; t'(a) = 0, where t denotes 
the first derivative of t. This means that t b is minimized by b = s. To 
calculate the first and the second derivative of t it is helpful to calculate the 
derivatives of u. At first note that the implicit function theorem provides 


eis üla) 9( (a) 
2 a [(2H(w(a)) — 1 — 2W(a) ®'(w(a))| 


wÙ > 0. 


Then, with h(y) := y ®(—y) — (y), it can be shown that 


u'(a) <0 and 
u'(a)a+2u (a) <0 


uta) AN 


is satisfied for all a > 1 (see Müller, 1995a, Lemma 12.7). The rule of 
L'Hospital provides lima;ı (a) = 0 and limaj; u(a) = 2 so that 


(a) = 1 (a ee) 


u(a) 28(ti(a)) — 1 


is converging to 0 for a | 1 (see Müller 1995a, Lemma 12.8). O 


3 Comparison with estimation problems 


An asymptotically linear estimator Gy for estimating (8) = LE with 
influence function Y% is under contaminated distributions asymptotically 
normally distributed. The asymptotic covariance matrix is C(w#,6) and the 
maximum asymptotic bias is given by 


lls = MaX(z 7\cIRxsupp(6) p(z, x)| (6) 


(see Bickel, 1981, 1984; Rieder, 1985, 1987, 1994). Solutions ~F which 
minimize the trace of the covariance matrix C'(w~,6) under the bias side 
condition ||7||5 < b can be characterized explicitly if the design 6 is based 
on linearly independent regressors or if the design 6 is A-optimal, i.e. 6 € 
arg min{tr(L I() LT); 6 € A} (see Kurotschka and Müller, 1992; Müller, 
1994a). As for testing the optimal influence function yý depends heavily 
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on the bias bound b. Moreover, only the solution Yf with b = bin := 
min{||||6; Y E€ Y} corresponds to the Lj-estimator, i.e. satisfies Yê = yt. 

Similarly to testing an optimal bias bound can be found by combining 
the efficiency criterion tr(C(#s,6)) with the bias value ||W€||s. A natural 


combined criterion is the asymptotic mean squared error 
Mg(b) := yE lls + tr(C(Y§, 6)) = b° + tr(C(Yf, 6)). (7) 


It can be shown that as for testing Ms is convex. But in opposition to 
testing Mg attains its minimum for a value b > bf... so that the L,-estimator 
does not minimize the mean squared error. See Miiller (1994b,c). 

Hence, we have the following situation: The approaches for estimation 
and testing look very similar and leads to similar constrained optimization 
problems of maximizing the efficiency under a bias bound b. Nevertheless 
the problem of finding an optimal bias bound by using a natural criterion 
which combines efficiency and bias leads to qualitative different results. 
For testing the best bias bound is b = bmin = s so that the Lj-test is 
optimal while for estimation the best bias bound is b > bf in so that the L4- 
estimator is not optimal. Moreover, for estimation the optimal bias bound 
depends strongly on the model, the design and the aspect L8 and it can be 
calculated only per computer. For testing the optimal bias bound is simply 


s, the rank of L. 


4 Example 


Consider a one-way lay-out model with four levels. i.e. we have four samples 
with unknown means 61, 62, 63, G4 so that the observations are given by 


Yan = Bit Zan, 


if the observation Ypy belongs to the sample 7, t = 1,2,3,4. This model 
can be expressed as a linear model with 8 = (61, G2, 83, G4)? € IR* and 


tan = x(t) := (11 (i), 12(i), 13 (i), 14(2))*- 


Often one sample, say sample 1, is a controll group. Then an interesting 
aspect of 3 is the linear aspect (3) = (62 — 61, 83 — Gi, Ba — Bi)? € RÈ. 
Then testing Ho : (8) = 0 against Hı : (8) # 0 is equivalent with 
testing Hp : G1 = b2 = b3 = 4. A D-optimal design for (8) is 6 = 
5 (€x(1) + €2(2) + €2(3) + €z(4)) which means that the four samples are of 
equal size. At this design the influence function of the Lj-test for Hp and 
the Lj-estimator for y(G) has the form 


(—1,—1,-1)7 sgn(z) 4 ree for i = 1, 


vee (ò, 134), a(d)? sen(z) 4 F, ford #1. 
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According to Theorem 1 this influence function provide the maximum rela- 
tive power within all ALE-tests for Hp. But this influence function does not 
provide the minimum mean squared error within all asymptotically linear 
estimators for (8). The influence function providing the minimum mean 
squared error has the form 


(—1,—-1,-1)" sgn(z) mnie rel, for i= 1, 
(12(i), 13(4), 14(4))? sgn(z) ==iehent, fori#1, 


Ub 


pp(z, x(t) = 
where b © 8.7213, wy © 0.0186 and v œ 0.2411 (see Müller, 1994c). 
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1 Introduction 


In regression models the £,-norm estimators receive their justification from 
robustness theory. For the error distribution only assumptions on the be- 
havior around the median, which should be zero, are required for consis- 
tency results. 

For nonlinear regression the consistence of the L,-norm estimators is 
shown by Oberhofer (1982). Richardson and Bhattacharyya (1987) ex- 
tended this result to general noncompact parameter sets by using a sieves 
method. The general approach of M-estimators in Liese and Vajda (1994) 
for nonlinear regression models includes also the L1-norm estimator. They 
obtained a similar result as Richardson and Bhattacharyya (1987) with a 
different method and conditions that are statistically more transparent. 

The concept of minimum contrast estimators and the method of sieves 
are studied in the nonparametric theory by van de Geer (1990), van de Geer 
(1995), Birge and Massart (1991) and Birge and Massart (1994). They 
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apply their general results for consistency rates of minimum contrast esti- 
mators to the L1-norm estimator of the nonparametric regression function. 
(Compare also the Section 3.4.4. in the recent book of van der Vaart and 
Wellner, 1996). 

The aim of this paper is to give consistency results for [;-norm estimator 
in semiparametric regression models, where the parameter of interest is of 
fixed dimension and the nuisance parameter is either given as an unknown 
function from a nonparametric function space or has a dimension which 
increases with the sample size. In this paper we will study simultaneously 
three models, namely nonlinear regression, nonlinear functional relation, 
nonlinear semiparametric regression, all to be introduced in Section 2. This 
allows us to demonstrate the line of proving the L1-norm consistency results 
and to emphasize the underlying problems. For illustration reasons the L1- 
approach is also embedded in the minimum contrast context, see Section 
3. The proof consists mainly of two steps given in Section 4: first the 
approximation of the empirical Lį-norm by its expected value, which will 
be done in Subsection 4.1, and second the identification of the parameter 
by using the expected values of the difference of empirical L)-norms in 
Subsection 4.2. The identification problem is specific for the individual 
models. This is not characteristic for the L;-norm approach, since the same 
problem also occurs in the L2-theory. The technique of approximation is 
more or less standard and is based on results of the increments of sub- 
Gaussian processes. In Section 5 the consistency results for each model 
are separately given. For the nonlinear regression model this is a known 
strong consistency result in form of an exponential probability inequality. 
The consistency result for the nonlinear functional relation models is new. 
Under this kind of entropy condition on the nuisance parameter space only 
results are known for the least squares estimator, see Zwanzig (1990). The 
nonlinear semiparametric model is of a special structure, because of the 
constrains imposed by the identification problem. Linton (1995) studied 
this model with a linear parametric part. The result given here for this 
model seems to be new as well. 


2 The general setting 


Here we introduce a general semiparametric regression model to be specified 
in the following. Suppose we have independent and in general not iden- 
tically distributed two dimensional real valued observations (y1,21),....; 
(Yn, Tn), generated by 

yi = 9 (Ei, B) + Eri, (1) 


ti = h(&, B) + €2, (2) 
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with 7 = 1,...,n. The probability of each observation (yi, xi) is Pz,g. The 
common distribution of the whole sample is denoted by Peg = []j_, Pe, 
and dominated by a o-finite measure Hp. 

The errors €;; are i.i.d. with distribution P., expected value zero and 
positive variances g. The error distribution does not depend of the pa- 
rameter 3. To get the consistency of Lj-norm estimators we will need the 
assumption E on the error distribution that the median is zero and that 
the distribution has enough mass in the local neighborhood of it: 


E 3 Do dk- Vd < Do such that 


P,(-d<e<0)> ked , e O0O<e<d)> Ked. (3) 


The functions A (.,.) and g (.,.) are continuous and known. The regres- 
sion parameter B6 € B C R? is the parameter of interest. The dimen- 
sion p of GB does not depend on the sample size n. The design points or 
variables {€),...,6,} C R are unknown and fixed. They are the nuisance 
parameters, whose number grows with the sample size n. We write the 
nuisance parameters as components of a column vector of dimension n: 
f= 6) = (61, wea e X(n) C R”. The common unknown parameter is 


0 = (£, 6) € O = X0) x BCR™?. (4) 


The model assumptions (1), (2) above include different models for dif- 
ferent specification of 4”) C R” and of the functions h and g. 


2.1 The nonlinear regression model 


Suppose the design is known. That is, we have ¥(™ = G SE 5 35 eo)" | 
We consider only the first equation (1) and obtain the nonlinear regression 
model with n observations 


Yi =g (€, B) +é1;,, with i=1,...,n. (5) 
Note that in this model no nuisance parameters occur. 


2.2 The nonlinear error-in-variables model 


For h (ĉi, 3) = & in (2) we obtain the nonlinear error-in-variables model, 


vi = 9 (&, 8B) + eu, (6) 
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Ti = Ei + Ei, (7) 


with ¿i = 1,...,n. This is exactly the functional one, because the variables are 
considered as fixed and unknown and play the role of a nuisance parameter 
with increasing dimension. For consistency we need additional assumptions 
on the set of nuisance parameters 4‘), because we know that for 7” = 
[0,1]” the least squares estimator is inconsistent, see Kukush and Zwanzig 
(1996). One interesting additional information may be 


nalea OS aa Gk eh, E 


On the first view this assumption seems to be artificial; it is, however, 
useful in applications, for instance in biology or chemistry. There the un- 
known design points é; often stand for different levels of concentration. 
The experimenter measures these concentration levels with error €9;. But 
he does have some influence on the level that the concentration lies at and 
he can guarantee with high security that the concentration level of the next 
experiment will be higher. 
The assumption (8) can be rewritten 


xm) — fe = (f (21) 05 f (2), a f (a) : F : [0,1] > [0,1], f increasing } . 
(9) 


where the z1 <,...,< Zn are fixed design points satisfying the following 
design condition: 


D 


lim max |z; — z;-1| = 0. 
nN— CO 2 


Another possibility is to consider 


aae ae a a a aC (0) 


with 
raoe recae A n 
m, a ) = m,a |V; : fom) (x1) — fim) (x2)| < L [£1 _ ro\° 
11) 


This kind of additional information is also used in the following semi- 
parametric model. 
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2.3 The nonlinear semiparametric model 


Assume that {21,...,2, } C [0,1] are known design points with D. Un- 
der the specification é; = f(z) and g(&,@) = m (zi, 6) + &, where the 
functions m and f satisfy the following identification condition, 


ID AnoVn > no 3{w;} > 0, iwi = 1 Yf E€ Ama (C, L) VB € Be such 
that 


Son (zi, B) f (%) = 0, (12) 


i=1 


we consider only the first equation (1) and obtain the model 


yi = M (zi, b) + f (zi) + Eri. (13) 


The condition ID contains the orthogonality in the sense of the empirical 
measure generated by the weighted design points 21, ..., Zn- This model (13) 
describes alternatives in the context of model choice. 


3 Minimum contrast estimates 


The Lı-norm estimator will be considered as a special minimum contrast 
estimator. We call a nonrandom positive real function C, : 0 € O° — Ry 
a contrast for 0 at Oin) iff it is lower semicontinuous and 


where O° denotes the compactification of the parameter set V™ x B in 


R°”, The contrast may depend on the sample size n. Examples are the 
empirical L,-contrast 


Chn (0) = a 
i=1 


g (€, 8) — g (€2,00) |" (15) 
or the asymptotic L,-contrast 

Cx (0) =C (8)= | |9(z,8) -g (2,6°)|" ac, (16) 
with the Oin) = (6°,€°) satisfying (1) and (2). The first depends on the 


unknown design points € € F), the other on an asymptotic design G. 
Under a unique parameterization of the regression function each distance 


106 Silvelyn Zwanzig 


measure d for functions g (., 8) E€ M seems to be a useful contrast at in) = 


6°, namely Cn (8) = d(g(.,8),9(-,8°))- 


We call a measurable function 


~ 


Cr (7, -) : R” x R” x O° :> Ry (17) 


a contrast function and require that it is continuous with respect to 0. In 
the general statistical experiment, which includes random processes and 
random fields as well, Liese and Vajda (1995) introduced a more general 
concept and called the function corresponding to (17) a contrast principle. 
Note in order to simplify the denotation we will suppress the dependence 
on the sample and let the tilde hints to this: C, (X, Y, 0) =: Cy (0) -We then 
define the corresponding estimator as follows. 

A measurable solution ĝ : R” x R” — ©° is called a minimum contrast 
estimator iff 


0 € arg min Cn (0). (18) 


Under the model assumptions above the existence of minimum contrast 
estimators are given by the Lemma 2 of Liese and Vajda (1995). 

The following lemma gives the connection between the consistency of 
the minimum contrast estimator and the uniform consistent approximation 
of the contrast Cn (0) by the contrast function Cn (0). It is a version of an 
“argmin” result, like the argmax theorem for i.i.d. experiments in van der 
Vaart and Wellner (1996). Consider the differences of the contrast and of 
the contrast function, 


ACn (0) = Cn (0) — Cn (9(n)) and AC,(0) = Cn (0) — Cn (9m) ) . (19) 


Lemma 1 Let p : Ri — Ry be strictly increasing , with p(0) = 0. Let 
d(.,.) be a semimetric on O°. For any e > 0 define the set 


6, (©) =OfN fo `d (4, (ny) > e} (20) 
Let Cn be a contrast with 
ACn (9) = Cn (0) — Cn (O(n) 2 P (a (9, O(n))) | (21) 
Then 
ve>0 P(a(5,0)) >€) <P( a eee 21] | 


TEMO) p (a (9, Oen) ) ) E A 
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Proof: From (21) follows that 
min ac g 
acOn(e) p(d(0,4(n))) 
Under 6 € On (e) we have C, (6) <= C. (4(n) ) and p (a (6, n))) > 
p (€) > 0, thus 


(23) 


AČ, (6) a 
—— n < 0. 24 
p (4 (8%) 
Hence under 6 € ©, (€) we obtain from (23) and (24) the following chain 
of inequalities 


Fe i, AOn (0) AČ, (6) ae [ACh (8) — AG, (8)| 


~ onlo p(d(8,4(n))) (a (4,8(n))) ono p (a (8, 8ny)) 
25 
which yields the statement of the lemma. O a 


Note that the rate of convergence of the minimum contrast estimator 
given by this lemma depends mainly on the separation property of the 
contrast (21) and the semimetric d(.,.) chosen in (21). 


4 The L,-estimator 


In the following we focus our attention on the Lj-contrast function only. 
This means, let from now on 


AČ, (0) = > wii (leu + Ag (ĉi, 8)| — eral) +5 wai (lezi + Ah (6, 8)| — lezl) 


i=l i=1 
(26) 
with 


Ag (&:,8) = g (€, 6°) — 9 (€i B) and Ah (Ei, B) = h (€, 6°) — h (£i, B). 
We will see that the L1-contrast is 
ACn (0) = Ezog0 AĈ, (0). (27) 


It is easily checked that the Lı-contrast does not coincide with the contrast 
defined by the empirical Lj-metric (15) or by the asymptotic L1-metric 
(16), see Lemma 4. 

Applying Lemma 1 for the consistency proof of the Lj-estimator, we 
have to do two steps: first the uniform approximation of the contrast by 
the contrast function with an appropriate rate, second the study of the 
separation condition (21). 
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4.1 The approximation 


One of the main advantages of the L1-approach is that the difference 
Z (0) = ACn (0) — AČ, (0) (28) 


as a stochastic process {Z (6) ,@ € O°} with index set O° C R”’? forms an 
sub-Gaussian process without additional assumptions on the tails of the 
error distribution P. The only things we need are Lipschitz conditions 
and a convenient semimetric d in 0°. Henceforth, we will use the following 
denotation for the sum of weighted squares with known normalized weights 


n 

Wji > 0, ) Wii = 1, Wmax = Max wi; (29) 
é ij 
4=1 


S wu lo EDP = f Ig EP dO €) = lo. B)ley 
i=1 


where Gw, (€) is the weighted empirical measure generated by the design 
points é € X0), We use also the corresponding notation for the scalar prod- 
uct (.,-),,- Note that in the unweighted case Wmax = n~t. The Lipschitz 
conditions are used with respect to both types of parameters. 


L1 Jno 3L, Li < œ Vn > no YB € B° VE, E e Xe such that 
9 (E, B) — g (EB) 2, + |h (€, B) — h (E, BE, < La JE- Ela (80) 
L2 Jno Sle, Lz < 00, Yn > no VE € XM VB, B' € B°, such that 
Ig (€, B) — g (6B), + [R (€, B) — h (£, BY, < L 12-6". 8 
In (31) ||.|| denotes the Euclidean norm in R?. 
Lemma 2 Suppose L1, L2. Set 
d (0,6')” = winax [IE - Elin + IIB - B'E) (32) 


Then there is a constant a = a (Lı, L2) independent of n and @ such that 
for allt > 0 and all n > no 


Q 2 
P; (|Z (0) — Z (0')| > t) < 2exp ra) . (33) 
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Proof: We have for Z (0) as defined in (28) with (26), (27) that Zn (0) 


j-1 X: (0) and HX; (0) = 0. Using the inequality ||a +] — |a + cll < 
|b + c| we obtain 
|X: (0) — Xi (6/)| < 2uns |g (i, 8) — g (£i: B')| + 2wai [h (Ei, B) — h (£i, 8')| 
and using the triangle inequality we get further 
|X: (0) — X: (6")| < di (0,9') (34) 


where d; (0,0) denotes the following seminorm d; (0,0) = dj; (0,6’) + 
dy; (0,0') , with 


dy; (0,0°) = 2wii |g (&, B) — g (Ei, B)| + 2uii lg (&, B) — g (E BA, 


do; (0, 0°) = 2wa; |h (ĉi, B) — h (£i, B)| + 2w: |h (E;, 8) — h (£i, B°). 


Because of (34), we can apply the Corollary 3.2 of van de Geer (1990). 
Thus there is a constant a’ independent of n and 0 such that 


2a! 
Pgo go (|Z (0) —Z (6°) | > t) < 2exp ae) ; 


with d (0, 0")? = Y% d; (0, 6’) . Under the conditions L1, L2 we estimate 


n 


So di (6, 9’)? < 16Wmax (Lı T Lə) (l£ == IA + |G B B'I?) 


i=1 
and obtain the statement with a = a/ (32 (Lı + L2))™*. O 


In order to formulate the entropy condition we need a few more defini- 
tions. Let us introduce them for a general set A with a metric d, because 
inside the proof we will use the notion of entropy in the context of several 
different sets. A family of subsets Uj, ...Uy is called an e-covering of A with 
respect to a metric d, if the diameter of each Uz, does not exceed 2e and if 
the sets cover A, A C UNU;. The e-covering number N (e) is the minimal 
number of U;’s in any €-covering of A. The e-entropy H (e€) of A is given 
by the logarithm H (€) = InN (e). The entropy depends on the metric d 
and on A. We therefore denote the local e-entropy of AN (a: d (a, ao) < D) 
by Hag (€, D). We will require a condition on the local entropy of the nui- 
sance parameter set only, that is A = ¥™ and d(€,é’) = |E —€ lw. - For 
abbreviation we write H R lig (e€, D) = He (e, D) . The entropy condition 


we need is: 
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Ent For all ó, nn > 0, with 6nn./Wmax > 1 


fg 4/ He (Lud, L6D) du 
is ee e a (35) 
L—oo Ló Nn y Wmax 


The following lemma is an application of a modified result of van de Geer 


(1990), which is an adaptation of the chaining method of Pollard (1984), 
on page 144. 


Lemma 3 Suppose L1 with Lı, L2 with Lz, Ent with n, and 6, then there 
exist constants Lo and Co = C (Lı, L2), such that for all L > Lo and all 
n > no 


JAC, (0) - ACn (0)| 
Peo go sup n2 9 
eca(7s) E — Elin + {16 — Bll 


= vont < exp (-CoL?n 6° wimax) 
(36) 


where On (€) = ((,8) a alae ee b'l” 2 e?) and AC,,(9), ACn (9) 
given in (26) and (27). 


Proof: We will apply a small modified version of Lemma 3.4. of van 
de Geer (1990), with A = O° and the semimetric d given in (32) and 
Zn (A) from (28) with Zn (A?) = 0. The used modified version is: Under 
the entropy condition on O° with respect to the metric d, that is for all 
6’, ”n > 0, with ó'n > 1 


1 / Hee. (uL6', Lé'D)du 
Jo Al ) =0 (37) 


= a B 
it holds 
|Zn (9)| 2N 2e 
Poal sop 2 l en semon ar e G8) 
ni (a d? (8, 0°) ( 


For 6’ = 6,/Wmax from (38) follows the result. The difference to the lemma 
of van de Geer is that we have y/n = ny. The proof of this modification has 
the same steps, but we have to change ,/n to n, in the entropy condition 
and in the exponential rate. Her assumptions on Zn (A) are not needed 
in the L,-context, since the main property she used in the proof is (33), 
that the process is sub-Gaussian. It remains to check the entropy condition 
on 0° (37). For Cartesian products A = A; x Ag with a = (aj,a2) and 
då (a,a’) < då, (a1, 04) + då, (a2, ay) , we know the following inequality: 


€ € 
Had, (e, D) < Ay da, (5.2) z0 Ay da, (5.2) ; (39) 
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It can be derived in a similar way as the formula (7) in Lorentz (1966), on 
page 152. Here we have A = 1) x B and d? (41,04) = Wmax |Ê — é"|? 


and då (a2,a3) = Wmax ||8 — BI", 
A, da, (€V Wmax, Dy Wmax) = Hx 11, (€, D) e 


B is a set of fixed dimension p, therefore the local entropy is bounded: 


2D 
Ay day (ey Wmax, Dy Wmax) = Hg. (€, D) < pln (Zve) . 


wW? ? 


Thus it suffices to require the entropy condition (37) for H4; a a, With 6 = 
6,/Wmax Only, that is, assume Ent. O 


4.2 The separation condition 


The aim of this subsection is to verify the separation condition (21) for the 
[,-contrast in (27). First we quote a result by Oberhofer (1982) in the 
form given by van de Geer (1990). 


Lemma 4 Suppose € r.v. whose distribution satisfies E with constants Do 
and Ks then for all |A| < Amax 


Ken 


D 
AAF < E(le+ Al —|el) < tA]: (40) 


Proof: This is the i.i.d. version of Lemma 4.2. of van de Geer (1990). O 
Applying Lemma 4 to the Lj-contrast with 


Amax = max max max {|g (€P, 6°) — g (€i B)| + |h (€, 6°) — h(G,6)|}, 


& BeBe 


we obtain 


OO). =C. (0°) SK 2 : 


max 


Dn (€, 8) ; (41) 
with 
2 2 
Dn (E,B) = |9 (€, 8) -g (€, B) +E -h(E (42) 
w1 W2 
Now we need separation conditions on g and h also, such that it is possible 


to estimate 
Dn (£, B) = p (d (0, ny) (43) 


for an appropriately chosen metric d on the parameter space. Deriving 
(43) means solving the identification problem in semiparametric models. 
In the same way this problem occurs also in the Lo-norm theory. This is 
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the main reason, why we are not able to obtain a nice consistency result 
for the parameter of interest 8 in the general setting (1), (2). From now on 
we consider the models separately. We say a regression function g fulfills 
the contrast condition Con iff 


Con Jno Yn > no Jan, 0 < an < œ, VEE X™ VB, B' EO 


9 (£8) — 9 (€,8)|,,, 2 an |B — Bll. 
Under Con for g we have in the nonlinear regression model (5), that 
2 2 
Dn (€,8) = |o (€°,6) -g (€°,6°)| > ak le-o. aa 
w 
In the nonlinear semiparametric model (13) the identification condition ID 


implies 
2 (m(z,8) -m (28°) F- P 0), =0, 


and under Con for m we have here also 


Da (6,6) =|m (2,8) -m (a) P|, 


-pen-n ro-ro saj- w 


For the nonlinear functional relation model (6), (7) the following lemma 
helps to solve the identification problem. This result is strongly related to 
the Lemma 1 in Zwanzig (1990). Define 


En (€, 8) = |a €) — o (€.8)|,, +|9 (E8) = 9 (56°) p, + [6-2], 
(46) 


Lemma 5 Under L1, Ino IT > OVn > no VE, £9 € (xŒ) VB, B° € Be 
such that 


oE) -o (E) jE- >r In&B). ar) 


Proof: Inside of this proof let us use the abbreviations g (2 6°) = g” and 
g (2°, B) = g?. By adding +g? in |g = ral , we obtain 


Dn (§; 6) = Ln (€, 8) Q — 2An (E, 8)) (48) 


with 
Ca mane a 20) a 


Se) = a ee 


(49) 
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for Ln (€,G) > 0 and An (£, 8) = 0 otherwise. It remains to show, that 
there exists a constant T > 0 such that 


ji 


sup sup A, (é, b) < =—T. (50) 
€e(¥(m))° Beor 2 


ee 
Let c = G with Lı from (30) and 7; such that 0 < 7 < L, We will 


distinguish two cases: 


2 


2 2 
w2 w2 wi 


. 2 ee 
Dese <el?-o™| it) [6-29], > elgg 
i) We apply the Cauchy-Schwarz inequality and the assumption (30) 
0 2 1090 0012 0/2 |.0 _ 002 
|g” = G19 | ERC Ey nd: TG: 
A 2 < Wy w1 < wW w1 


with Ln = Ln (£, 8) given in (46). Note |g? — g% < Ln. Under i) we 
have 


0 002 \? 9 
An (£, 8)? < Lic (eo < Lc < (5-7) A 


n 


and thus in case i) (50) follows. Consider case ii). Because of 


2 
o oo) _/0_ 
0 < |(®-9)- (9-9), 
2 2 
= 0° - g”! + 9° - g| -2 (9 -g”, g -9) , 
w1 wi wi 
we have 
2 2 
2 (g =g, g -g) <P- +P- 
w1 w1 w1 
Using this and (46) we obtain for An = An (£, 8) 
2 2 012 
2A, < ar a + |9? = ghin Sie e P 
oa De Ln 


From the assumption (30) we get 


2 
JE — £| 
a a a a aa 
For positive a, the function f (x) = a04 İS increasing in z. Using ii) 
we have ca < x and f (x) > EIRE We obtain 2A,, < 1 — Td For 
Tə such that 27). = IOF’ 0> n> 5 one has under ii) A, < 5 — 72. 
We choose T = min (71,72) and get (50). O 
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5 The consistency of the L,-estimators 


In this section we summarize the results above and obtain the consistency 
of the L,-estimators in the different submodels. 


5.1 The nonlinear regression model 


Consider the model (5). Then we have for 


B= arg min Lus vi — g (&,2)| (51) 
the following strong consistency result. Set 
Gmax = max sup |g (€P, 8) — g (E2, 6°). (52) 


Theorem 1 Suppose for the error distribution E with the constants k-, Do 
and suppose for the regression function g that 


Jno ILaVn > no Jan, an > 0 YB, 8’ € B° 


a IB- 6"? < |o (8,8) -9 (€°.8’)|, <t le- eP. 63) 


Then there exists a positive constant Co such that for all L > 0 and all 


n > no 
I5- 6°|| > L) < exp (- 2 (552) ). (54) 


Peo go ( 


Wmax 


Proof: Under (53) from (44) and from (41) with (52) follows that the 
separation condition (21) is fulfilled with p (||8 — B°|\) ~ te Dota I8- Boj. 
Lemma 1 gives 


AC, (8) — AČ, (8) D,a? 
Pop [P-A] 1) < Pea ( ap BEO- y spat). 


The entropy condition Ent is satisfied for the one point set. Then the result 
(54) is a consequence of Lemma 3 with nn = Peseta o 


max Wmax 


2 
Note the result is interesting only for o 4) Wane: 
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5.2 The nonlinear error-in-variables model 
Consider the model (6), (7). The Lj-norm estimator is defined as 
@=argmin min yw lyi — g (&, B)| + wai |x; — £il . (55) 


BeBe LERE 


Set 
— €)). 


Gmax = max sup sup lg (€;, 3) — g (é Ee) (56) 


* gex(n)e BEBS 


Then we have the following exponential probability inequality. 


Theorem 2 Suppose for the error distribution E with the constants ke, Do 
and suppose for the nuisance parameter set the entropy condition Ent is 
aî Ke DoT 


satisfied with nn = oo and bn. 


Suppose for the regression function g that 
Ang ILa, Lz < 00, Wn > no Fan, an >0 VG, B'E B° YE, E  X™* (57) 
az IG — B'I? < lg (6,8) -9 (E, B), < Le 6-2? 
(58) 
and |g (£, B) — g (&', Bk, S La lE - E'n - 


Then there exist positive constants Co, Lo such that for all L > Lo and all 
TL > no 


Pa po (|B -A|| > Lôn) < exp E (RE) ar) . (59) 


Proof: Without loss of generality we set a, < 1. Under (56) from Lemma 
4 and under (58) from Lemma 5 follows that the separation condition (21) 
is fulfilled with 


p (VIB -PIP +E- EB, ) = neDorg 


Lemma 1 gives 


= (p-e +e- ek) 


Gmax 


Pom ([B-6°| > Uh) < Pem ([- P+ E- EÈ, >e) 


pe Aco -Aa 1) 
< Peo go P ae en aa a ee ye 
TN \ac6 5.) gear (I8 = BI? + 1E- En) 


G max 
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The entropy condition Ent is assumed above explicitly. Hence the result 
; . KeDoTa 
(59) is a consequence of Lemma 3 with m = 7G. 


Consider now special cases and the unweighted case Wmax = n71. 


Corollary 1 Suppose for the error distribution E and suppose for the re- 
gression function g that (57),(58) is valid with 


= < const, for alln. (60) 
a 


Suppose i) X™ defined in (9) or ii) ¥™ defined in (10). Then 
8B — B° Pgo go — a.s.. (61) 
Proof: Under (60) in the unweighted case we have nny/Wmax < consty/n. 
i) For Y™) defined in (8), the entropy is H¢ (6, D) < const4 Int (+) and 
the entropy condition Ent is satisfied with 6 = n73 (In n)? , (see Example 


2.1 of van de Geer (1990)). From Theorem 2 follows that there exists a d, 
0<d<1,foralle>0 


` Peo go (||8 — B| > e) < 5> exp (—né € const) (62) 
n=1 n=l 


< const (ng) > n~? < oo. (63) 


n=no 
We obtain the statement by the Lemma of Borel Cantelli. 


ii) For 4) defined in (10) the entropy can by derived from the classical 
result of Kolmogorov and Tichomirov (1960), for the sup-norm |f — f°|__ = 


sup 
max,¢;0,1] |f (£) — f° (x)| 


aba 
m+a 


Ax) I. (6, D) < const ($) 


Since under the design assumption D max; |z; — zi—1| < cg for n > no and 
since 


E-E < max|f e -P e) s E- a 
< max|f (z) — f° (a)| + 2Lca 
we have 4") C{f :|f—f%|,,, < 3(C + L)}. Thus 
1\ mre 
He (6, D) < const (3) . (64) 


Then the entropy condition Ent is fulfilled for 6 = n~° with b = Gat > 
0 and the result (61) follows by the same arguments as in (62). O 
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0.3 The nonlinear semiparametric model 


Consider the model (13) with design condition D and identification condi- 
tion ID. Then we can derive on the same way as above the strong consis- 
tency of the Lj-estimator 


n 
= arg min min Wi (Yi — i — ois 
B= are o Do aa ~ F(a) 
Theorem 3 Suppose for the error distribution E and suppose for the func- 
tion m that 


Jno 3L < œ, Ja > 0 Yn > no Y8, B' € BS 


a? |B — B'|? < |m (2,6) -m (2,8) o, < L2 8-6? (68) 
and 
Gmax = Max sup lm (z;,3) —m (zi, B°)| < const. (66) 
i BeBe 


Then B 
6) => pe Peo go — a.s. (67) 


Proof: Under ID and (65) from (41) and (45) the separation condition 
(21) of Lemma 1 is satisfied with Amax = Gmax + 2C, where Gmax from 
(66) and where C from (11), 


2 
p (le -Al = ebeg e-e. 


Because of (64) for n./Wmax < const,/n and for 6 = n~? with b = 


aa > 0 the entropy condition Ent of Lemma 3 is valid. Then from 
both lemmata follows an inequality of type (62) and we obtain (67) by the 


same arguments as in (62). O 
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Abstract: This paper is a comparison of two methods for computing 
Lı estimates of the parameter vector ĝ in the linear model. The main 
methods in the comparison are in two groups: special purpose linear 
programming (LP) methods which exploit the structure of the objective 
function and iteratively re-weighted least squares (IRLS). The special 
purpose LP methods included in the review are: (i) the Barrodale and 
Roberts (BR) algorithm and (ii) the modified form due to Bloomfield and 
Steiger (BS). The IRLS methods is a new development which exploits the 
piecewise differentiability of the objective function and which avoids the 
difficulties previously associated with least squares based schemes. All 
algorithms have been implemented in a common language, in order to 
provide a better basis for comparison. To summarise: we found that our 
implementations of the BR & BS algorithms are generally quicker than 
existing implementations and general purpose LP solvers; the new IRLS 
algorithm is faster in circumstances where the number of observations is 
very large relative to the number of parameters to be estimated. 


Key words: Regression, linear model, minimum absolute deviations, LP 
solvers, iteratively re-weighted least squares, piecewise differentiability. 


AMS subject classification: 65J05, 62G05. 


1 Introduction 


The method of minimum absolute deviation (MAD) or L; estimation, to 
give it one of the many names by which the technique is known, is a robust 
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method for estimating the parameters in the linear model: 
y=XBH+e. 


The objective function to be minimised is: 
f(8) = >> lyi — p" zi (1) 
i=1 


where B € R*. Minimising f(8) will also give maximum likelihood es- 
timators of @ when the {yi} are random sample from the double sided 
exponential distribution. In this paper it is assumed that the function is 
not degenerate, in which case it will possess a unique minimiser, (* say, at 
which up to k of the residuals: 


ri = yi — b"! zi (2) 


will satisfy: 
ri =0, i € B, say (3) 


The set B defines a set of basis vectors x; which span RF. If the subsets of 
{yi} and {z;} defined by B are denoted by y* and X*, then (* satisfies: 


Xg = y* (4) 


There are two general types of algorithm for calculating MAD estimates 
of the parameters of a linear model. The first type relies on the fact that 
the objective function f(@) at equation (1) can be formulated as a linear 
program, Charnes et al (1955). This type of method includes a procedure 
due originally to Barrodale and Roberts (1973) [henceforth BR] which ex- 
ploits the fact that the MAD objective function may be written as an LP 
with special structure. Other LP methods, which also exploit the special 
structure, have been reported by Bloomfield & Steiger (1980) and Seneta & 
Steiger (1984). The second type of procedure uses iteratively re-weighted 
least squares (IRLS). This method was reported by Schlossmacher (1973) 
and Fair (1974). According to Bloomfield and Steiger (1984, page 259) 
[henceforth BS], however, it was due originally to Beaton & Tukey (1974), 
Comparative studies are reported in BS. There is a comprehensive review 
of algorithms in Dielman (1992). 

One of the motivations for this paper is that, although there are very 
strong similarities between the special purpose LP solvers mentioned above, 
comparisons are limited by the fact to date, and to the best of our knowl- 
edge, the software implementations are quite distinct. Self evidently, com- 
parison is greatly facilitated if the software is written in the same language 
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and meets similar design criteria. A second motivation is our wish to de- 
velop algorithms to minimise mixed objective functions of the form: 


n m 
f(B) =a) u= a Faaa - 6X]. (5) 

i=l j=l 
These objective functions arise in other robust methods and in dynamic 
estimation schemes in which the current parameter estimates are (approx- 
imately or asymptotically) normally distributed. Robust estimation meth- 
ods using similar convex objective functions, in which there is a modulus 
term, have been studied by Dodge & Jureckova (1991 and 1992) and related 
computational aspects are reported in Dodge et al (1991). Objective func- 
tions of the form in (5) also arise in portfolio optimisation when the conven- 
tional quadratic programming formulation is extended by the inclusion of 
transactions costs. In the case of portfolio optimisation, the minimisation 
of f(@) is invariably carried out subject to a number of linear inequality 
constraints of the values of the parameter vector 8. Algorithms to minimise 
f(@) given at equation (5) may use IRLS methods - see Adcock & Meade 
(1995) for an example of IRLS used in portfolio optimisation. However, the 
well reported deficiencies of IRLS have prompted us to consider the general 

question of algorithms for MAD estimation ab initio. 

The purpose of this paper is therefore to compare the solutions times 
of the main established special purpose LP and IRLS algorithms for MAD 
estimation. We use new implementations of the Barrodale and Roberts 
and the Bloomfield and Steiger algorithms. We also present a new proce- 
dure for IRLS. The algorithm that we describe in Section 3 of this paper is 
different from the scheme developed by Schlossmacher and others in that 
it exploits the piecewise differentiability of the objective function. All al- 
gorithms included in the comparison have been re-implemented in a single 
programming language and in a similar programming style. The aim is to 
provide a fair basis for comparison of the solution times. In addition, and 
as reported below, we have found that our new code offers performance 
improvements over existing software. 

The structure of the paper is as follows. Section 2 describes methods 
based on special purpose linear programming methods. Section 3 presents 
our scheme which uses iteratively re-weighted least squares. We compare 
performance using a number of different data sets in Section 4. The final 
section of the paper contains a summary and concluding remarks. 


2 Linear programming methods 


Following Charnes et al (1955), the objective function at (1) may be written 
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exactly as: 


f(8) => |y-— 8’ xi| = So let + | | (6) 
1=1 i=l 
where: 

ete >0 (7) 
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and where the corresponding n-vectors et and e” satisfy: 
y — XB =e*t -e7 (8) 


Minimisation of f(@) is now a linear programming problem involving the 
2n+p variables e*, e~ and , together with the n equality constraints at (8) 
and the 2n non-negativity constraints at (7). Calculation of the minimiser 
6* may be undertaken using standard LP solvers. However, these methods 
are very slow when compared with special purpose solvers which can exploit 
the structure of the LP formulation of the objective function. The following 
results are summarised from BS who provide further details and proof of 
the properties of the algorithms. 


2.1 The Barrodale and Roberts (BR) algorithm 


The BR algorithm may be viewed as row and column operations on a 
(n+k)*(k+1) matrix A. The initial value of A is: 


_{|A4 ¥y 
m7 8] 


In the steps below the elements of A are {a;i}. Note that in this notation 7 
indexes columns and j rows rather than the more conventional arrangement. 


Step 1 compute: 


SS) 
| 


be Jaji] 3 J Over Qjk41 = 0 
j 


hi = ` aji - Sign(ajk+1) 
j 
li = min(gi — hi, gi + hi) 


where: i = 1(1)k and j = 1(1)n. 
Step 2 determine: 


p= I where lr = min(l;) 


If l7 > 0 then go to Step 5 
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Step 3 determine the pivot row q by finding the MAD estimate of t in: 
f= X |ajk+1 — tajpl 
j 


ie, find t* = yg/2qp. It should be noted that this requires a sort routine 
and that suitable choice of sorter can affect algorithm timings. 


Step 4 pivot on row q column p, ie compute new columns a’ j of A: 
ap = Qp/Agp 

= Qj — agjâp J FD 

and go to Step 1. 


Step 5 The minimiser 8* may be recovered from the block of A corre- 
sponding to the initial zero vector, ie in column k + 1 rows n + 1 through 
n + k, together with a sign reversal. Specifically: BF = —an-+i k+i- 


2.2 Bloomfield and Steiger (BS) algorithm 


There are two modifications to the BR algorithm which are described in BS. 
One is due to Bloomfield & Steiger (1980) themselves. It is the essentially 
the same as BR except that in Step 1 a heuristic is used to compute g; and 
h;. In an obvious notation: 


= g rl Dales he? = he / > lajl: 
j 


According to Bloomfield & Steiger, the BS algorithm often converges more 
quickly than the original BR procedure. 


3 Algorithms based on iteratively reweighted 
least squares (IRLS) 


Iteratively re-weighted least squares (IRLS) is an alternative approach to 
LP. The usual approach is to write f(8) identically as: 


f(8) — BT xi)" /|yi — B” zil (9) 


| 
~ 


If Bp is an approximation to ĝ*, a new approximation is computed by 
minimising the sum of squares: 


f) = tu: — B* xi)" /|yi — Bp zil (10) 


1=1 
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which is equivalent to differentiating (9) while holding the |y; — G7 x;| terms 
in the denominator fixed. In conventional OLS matrix notation, the equa- 
tion for the new approximation 6,41 may be written as: 


X" WX Bp+1 = X’ Woy (11) 


where: 
Wy = Diag{1/|yi — Bp zil} (12) 


Since it is known that, at the minimiser 8*, up to k of the residuals y; —3*! x; 
will equal zero, this procedure requires some modification or it will fail as 
the elements of Wp become very large. A common modification is to define: 


W, = Diag{W,i} 


where: 


Whi 0 if Yi — pi zi = 0 (13) 


1/ 


This algorithm does not run without problems in practice and it is crit- 
icised in BS on the grounds that it is slow and prone to be unstable. A 
modification to the basic IRLS scheme was introduced by Adcock & Meade 
(1995). They note that at the points at which it exists, the vector of partial 
derivatives of f(@) is: 


F'(B) = 2X* WX — 2X7 Wy — (J zi- Da) (14) 
A B 


otherwise 


yi — Bp ti 


where: 


W = Diag{1/|\yi — 6" zil} = Diag{Wi} (15) 


A= {i : (yi — pT zi) < 0} and B = {i : (y; — BT zi) > 0}. This suggests the 
iterative scheme with limiting equation: 


B= (XTWX) H{XTWy +0.39 2: -Y` 2:)) (16) 
A B 


which may be re-arranged as: 


p+ = bp- 0.5( X" wX) O Lim Do (17) 
A B 


Bp — 0.5ôp say, 


where 6, is the step length at iteration p. For practical purposes, the 
modification of Wp described at (13) is employed. The process terminates 
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when the absolute change in the value of the objective function is less than a 
given tolerance and the absolute change in each estimated parameter value 
is also less than a set tolerance. 

To improve convergence, we also employ a number of empirical modi- 
fications of the scheme. When the process described above converges, to 
6‘ say, the label set corresponding to the k smallest absolute values of the 
residuals |y; — ‘Tz; | is used to define a basis B*. If the subsets of {y;} 
and {z;} defined by B* are collectively denoted by y* and X*, then the 
minimiser ĝ* is computed as the solution to: 


as long as f(G*) < f(@‘). Otherwise the minimiser is taken as 6‘. It should 
be noted that if f(G*) > f(G‘) then G* cannot be the minimiser which 
implies that 8‘ is not the minimiser either. However, for the data sets 
described below, this algorithm always converged to the correct solution. 
That is, the solution computed by the IRLS method described above was 
always the same as that computed by the BR or BS algorithms. In this 
context the same is taken to mean an accuracy equal or better than the 
process termination parameters. 


4 A comparison of performance 


In order to compare the computational efficiency of the MAD algorithms 
described, the times taken by the algorithms to solve a range of problems 
were measured. If a general data set is denoted as {y;; xij; i = 1(1)n, 
j = 1(1)k} then the test data was generated by the following procedure. 


1. The z; are sampled from a uniform distribution on the interval 


(-1000, 1000), for i = 1(1)n. 
2. The remaining zij , j = 2(1)k are generated by the equation: 
Tij = Wij + CFL; 


where the uj; ~ U(-1000, 1000) for i = 1(1)n, j = 2(1)k and where the 
constants cj were set to 2.0, j = 2(1)k, to control the collinearity between 
the Tij. 


3. The dependent variables were generated by the equation: 
k 
Yi = $ bitij +Ei 
j=l 


where the true coefficients 6; ~ U(-1, 1). 
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MAD estimation corresponds with the maximum likelihood estimation 
of the parameters {8;} where the residuals €; follow a double sided expo- 
nential distribution. BS use both the Gaussian and the Pareto in their 
algorithm comparisons. This led us to generate three different data sets 
where the values of the error term €; are sampled from three different 
distributions, namely the double sided exponential, the Gaussian and the 
Pareto with appropriate parametrisation. In order to standardise the er- 
ror distributions, the interquartile range for each distribution was set to 
(-50, 50). Data sets were generated with the combination of numbers of 
observations and numbers of variables shown in Table 1. For each error 
distribution, four sets of observations for each combination marked with 
tick, giving 480 = 3 * 4 * 40 data sets in all. 


Table 1: Data sets generated 


Number of Observations 


10 20 50 100 200 500 1000 2000 5000 = 10000 
Number of 1 * x * * * * x x x x 
variables 9 * x * x * x x x x x 
5 * * * * * x * x 
10 xX x x x *K x x 
20 * * * * x 


The BR algorithm and the iteratively re-weighted least squares (IRLS) 
algorithm were used to estimate the parameters of each data set. Both 
algorithms were programmed in Fortran 77. We implemented a new version 
of the BR algorithm. This follows the procedure described in Section 2. The 
computations were performed on a Silicon Graphics workstation. Since the 
solution times for small problems was very short, the actual time measured 
was that to solve the same problem ten times. 

The convergence parameters in the IRLS algorithm were set so that the 
algorithm was deemed to have converged if: 


k 
F(Bp-1)/f (Bp) -1 < 10% and = Y Isp — Bipi] S 10° . 
j=l 


It is well known that the values of these tolerances can affect solution 
time substantially. The above values were chosen after some initial inves- 
tigation with the aim of ensuring that, for those data sets where the new 
IRLS method converged satisfactorily, the numerical discrepancies between 
IRLS and LP solutions were small. As already reported in Section 3, for 
the data sets considered, all three method always converged to the same 
solution. 
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In order to gain some understanding from these timings, a model of 
the computation time, T say, as a function of the number of variables, k 
say, and the number of observations, n say, was constructed. Anderson & 
Steiger (1982) proposed a model of the form: 


T = y +NN + yk + y3nk +n 


Trials with this model were unsatisfactory in that the estimated values of 
T were negative for small values of k. Beasley (1990) suggested a log-linear 
model for the timing of the LP solutions. This provided a basis for the 
following model: 


In(T) = yo + 11 In(k) + y2 In(n) + 73 In(In(n)) + n‘ 


The coefficients ņn were estimated by minimising )>|n‘|. The estimated 
coefficients for the BR algorithm and for IRLS are in Table 2 which shows 
the estimated coefficients over all data sets and for each error distribution 
separately. 


Table 2: Estimates of parameters in timing model 


Error Barrodale Iterated Least 

Distribution and Roberts | Squares 

Neg. Exp. 0 -10.76 -9.57 
1 1.19 1.58 
2 2.32 1.29 
3 -3.96 -0.89 
mean |e’| 0.20 0.38 

Gaussian 0 -10.68 -9.51 
1 1.18 1.58 
2 2.38 1.17 
3 -4.21 -0.55 
mean |e’| 0.19 0.35 

Pareto 0 -10.73 -8.59 
1 1.19 1.74 
2 2.35 1.54 
3 -4.15 -2.15 
mean |e’| 0.19 0.31 

Combined 0 -10.70 -9.30 
1 1.19 1.56 
2 2.34 1.30 
3 -4.08 -1.02 


mean |e’| 0.20 0.42 
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Comparison of the parameters for the time taken by the BR algorithm 
shows a consistency between the different error distributions. There is no 
discernible difference in timings caused by the choice of error distribution. 
The timings for IRLs are more dispersed about the fitted equation, showing 
that for the same dimension of problem there is twice as much variability 
in solution time in terms of mean absolute error than for the BR algorithm. 
For a problem of the same dimensions, the Pareto distribution appears to 
require a greater solution time than the other distributions by a factor 
of about two. Anderson & Steiger (1982) conjectured that the timing of 
the BR algorithm increased with approximately n?. Their conjecture is 
confirmed by these results which suggest that timings are proportional to 
n234, The term in In(n) adjusts this slightly over a wide range of numbers 
of observations. The timing increases slightly faster than linearly in terms 
of the number of variables: it is proportional to k!?. 

The reliance of the IRLS technique on matrix inversion means that the 
computation time increases faster than linearly with k. In fact it appears 
that computation time is proportional to to k!°. However, in contrast to 
the BR algorithm, the IRLS computation time increases only a little faster 
than linearly in the number of observations, ie it is proportional to n!. 

The actual times plotted along with the estimated times for the BR and 
IRLS algorithms are shown in Figures 1, 2 and 3. 


Figure 1. Times for Barrodale Roberts 
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Figure 2. Times for IRLS 
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The greater variability of the IRLS solution times is apparent. More 
interestingly, the superiority of the BR algorithm decreases as the number 
of observations increases. This crossover is clearly brought out in Figure 
4 where the expected solution times for each algorithm is shown for one 
and five variables. Thus, for one variable and about 1500 observations, the 
expected solution times are the same. For a larger number of observations, 
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one would expect IRLS to be quicker. For five variables, the cross over 
occurs at about 3300 observations. We note the results of a simulation 
study of times for MAD estimation of simple linear regression reported by 
Armstrong & Frome (1976), who also compared IRLS and LP methods. 
Although their results cannot be compared directly with ours, they also 
found a similar superiority of LP over IRLS as far as mean solution time is 
concerned. However, as the table in their paper indicates, the superiority 
declines as the number of observations increases. 

As indicated above, as part of this study we wrote up a fresh modular 
implementation of the BR algorithm. This has a built in facility to be 
converted to the Bloomfield and Steiger algorithm which we also used to 
compute solution times for the above data sets. The parameter estimates 
in the above timing model for the BS algorithm are shown in Table 3. 

As the table shows, there is no discernible difference between the two 
solution times for the BR and BS algorithms. The estimated value of y2 
for the BS algorithm is broadly consistent with Bloomfield and Steiger’s 
contention that the computation time is linear in the number of observa- 
tions n. However, the more interesting coefficient is that for the number of 
variables which shows that the solution time increases less than linearly. It 
is in fact proportional to k°-®’. The disadvantage with this implementation 
of the BS algorithm is that it is slow for small problems. 


Table 3: Estimates of parameters in timing model for BR/S algorithm 


Error Distribution 
Combined 


0 -2.07 
1 0.67 
2 1.21 
3 -4.11 
0.44 


3 


mean |e 


Finally, we also used two other MAD algorithms which are in the public 
domain and computed the solution times. The first was the version of the 
BR algorithm provided in the NAG library. The second was a line by line re- 
implementation of the BR algorithm as described in Barrodale and Roberts 
(1974). We found that, with respect to the number of variables, these two 
algorithms are similar to our modular implementation of BR and BS in that 
the solution times vary linearly with the number of variables. However, 
we also found that the solution time of these public domain algorithms 
increases much more steeply with the number of observations. 

It should be noted that none of the above data sets are degenerate and 
that, because of the use of the uniform distribution, there are no extreme 
outliers in the independent x variables. Investigation of conditions in which 
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the new IRLS algorithm, or indeed BR/BS, might fail remains a topic for 
further investigation. At present, we are inclined to the view that some 
algorithm problems may be due in part to the use of an inappropriate 
programming language. 


5 Summary and concluding remarks 


In this paper, we have presented and compared two methods for MAD 
estimation of the parameters in a linear model. They are described in detail, 
with accompanying pseudo-code in Adcock & Meade (1997). There is no 
single algorithm that is superior to all the others, at least as far as the data 
sets that we have investigated are concerned, The IRLS algorithm that we 
have described in this paper did not suffer from any problems of convergence 
or of numerical accuracy. Furthermore, we found it to be faster than all of 
the special purpose LP algorithms for data sets where the number system 
is greatly over-determined. We did not find significant differences between 
the BR and BS algorithms. Our investigations of various implementations 
of the BR algorithm indicated that the speed of the sort routine, which is 
a necessary step in each iteration of the algorithm, is crucial to the overall 
time taken. According to Dielman (1992), it is also likely that the algorithm 
due to Armstrong et al (1979), which employs an LU decomposition of the 
basis matrix, may lead to further performance improvements. This remains 
a topic for further investigation. 
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Abstract: We propose two finite algorithms to compute the exact least 
median of squares (LMS) estimates of parameters of a linear regression 
model with p coefficients. The first algorithm is similar to Stromberg’s 
(1993) exact algorithm. It is based on the exact fit to subsets of p cases 
and uses impossibility conditions to avoid unnecessary calculations. The 
second one is based on a branch and bound (BAB) technique. Empirical 
results suggest that the proposed algorithms are faster than the finite 
exact algorithms described earlier in the literature. 


Key words: Branch and bound, exact algorithms, high breakdown regres- 
sion, least median of squares, robust regression 
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1 Introduction 


In this paper we consider the multiple linear regression model 
Y= Z0 + e, (1) 


where Y is an n x 1 vector of dependent variables, 0 is a p x 1 vector of 
unknown parameters, Z is an n x p design matrix of predictors, and e is 
an n x 1 vector of true residuals. We denote the ith component of Y and 
the ith row of Z by y; and 2t, respectively. We suppose that the design 
matrix Z is fixed and has rank p. Sometimes we also assume that any 
p x p submatrix of Z is nonsingular; in this case we say that Z verifies 
the Haar condition, or that the observations are in general position. An 
estimate of 6, say 6, gives n residuals e;(6) = y; — 2t. The most well-known 
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estimator of @ is the Least Squares (LS) estimator. The LS estimator is 
optimal in several situations but it is severely affected by outliers. It also 
suffers from the problem of masking, which occurs when a data set contains 
multiple outliers and, at the same time, these outliers are not detected 
by the usual LS diagnostic procedures. To get a reliable outlier detection 
and estimation, a high breakdown point estimator should be used. Such 
an estimator is the Least Median of Squares (LMS) estimator, introduced 
by Rousseeuw (1984). The LMS estimator, which we denote by Ôr MS; 1s 
defined by ; 
Onus = argming{|e:(8)|}r:n, 


where the subscript h:n denotes the Ath order statistic of n numbers. To 
achieve the maximum breakdown point when the observations are in general 
position, the coverage h is chosen as h = [n/2|+]|(p+1)/2], where [] denotes 
the greatest integer function. For h = n, the LMS estimator is equal to the 
Chebyshev estimator, denoted by c, and defined by 


6¢ = argming{max |e;(0)|}. 


The minimal value of the objective function which defines Ôc will be called 
the Chebyshev criterion. 

The computation of the LMS estimate is difficult. The most widely used 
algorithm to approximate the LMS estimate is the PROGRESS algorithm 
(Rousseeuw and Leroy, 1987). This algorithm computes the exact fit of 
several elemental sets (subsets of size p) of the data set. The exact fit 
with smallest hth absolute residual gives an approximate LMS estimate. 
The considered elemental sets can be either all ts) possible such sets, or 
a random subsample of them. If the regression model has intercept, the 
intercept of each exact fit can be adjusted to yield a smaller hth absolute 
residual. For a simple regression model (p = 2), Steele and Steiger (1986) 
prove that an algorithm which examines all elemental sets and adjusts the 
intercept for each of them, obtains the exact LMS estimate. However, for p 
greater than 2 such an algorithm does not necessarily yield the exact LMS 
estimate. 

Stromberg (1993) proposes an exact algorithm based on the fact that 
the exact LMS estimate is a Chebyshev estimate for some subset of size 
p+1 of the data. Any set of p + 1 indices from {1,...,n} will be called 
here a reference set. Stromberg’s algorithm examines all reference sets, 
computes the Chebyshev estimate for each of them, and then sets 0; ys as 
the Chebyshev estimate with smallest hth absolute residual. 

In this paper we develop two algorithms to compute the exact LMS esti- 
mate that are computationally feasible for small or moderate size samples. 
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Section 2 describes some properties of the Chebyshev fit and discusses its 
computation. In Section 3 we propose an algorithm to find the optimal 
reference set. This algorithm differs from Stromberg’s in some important 
features: It is based on exact fits of elemental sets; it uses accelerations 
based on impossibility conditions, and it also avoids completely sorting of 
residuals. In Section 4 we propose a finite exact algorithm based on a branch 
and bound (BAB) technique. The BAB algorithm finds the subset of size h 
whose Chebyshev estimate gives the exact LMS estimate without exhaus- 
tive enumeration of all h-subsets. We describe the basic BAB algorithm 
and some strategies to improve its efficiency. An empirical comparison of 
the LMS algorithms is described in Section 5. 


2 Chebyshev regression 


In this Section we describe some properties of the Chebyshev fit and dis- 
cuss its computation. Osborne and Watson (1967) proved that, if the design 
matrix Z of model (1) has rank p and n > p, then: First, there exists a 
Chebyshev estimate that is equal to a Chebyshev estimate for some ref- 
erence set of the data; second, the Chebyshev criterion is identical to the 
Chebyshev criterion for the optimal reference set; third, all cases of the 
optimal reference set have residuals whose absolute value is equal to the 
Chebyshev criterion, and fourth, the rank of the design matrix of the opti- 
mal reference set is p. It is well known that the Haar condition is sufficient 
(but not necessary) for uniqueness of the Chebyshev estimate. 

For a reference set, both the Chebyshev criterion and a Chebyshev 
estimate can be explicitly computed. Assume that n = p+ 1 and the 
design matrix Z verifies the Haar condition. In this case two methods 
to obtain the Chebyshev estimate are available. The first method (see, 
e.g., Cheney, 1966; Meicler, 1968) is based on the LS estimate O15 = 
(Z'Z)-*Z'Y. Let é = (é),..., p41) be the LS residual vector, and denote 
st = (sgn(é),...,sgn(ép41)), where “sgn” represents the sign function. 
Then the Chebyshev criterion is 


0 if sett é a ee 
= 2 
ss p A i e2 / DAA \é;] otherwise (2) 
and the Chebyshev estimate is 
6¢ = 61g —w(Z*Z) 12's = (ZZ) Z(Y — ws). (3) 


The second method (see, e.g., Meicler, 1969; Armstrong and Kung, 1980) 
is based on the exact fit to one of p+ 1 possible elemental sets. The set 
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{1,...,p +1} is partitioned into an elemental set J and its complement. 
Suppose (without loss of generality) that J contains the first p indices, 
and let r = p + 1 be the complementary index. Denote by Zj and Yj 
the respective submatrices of Z and Y formed with the rows indexed by 
J. As Zy; is nonsingular, 6; = ZV is the exact fit to the elemental 
set J, and e; = y; — 26 J is the zth residual based on the exact fit. Let 
€ = (&,...,&) = 2tZ7", and ot = —sgn(e,)(sgn(é),...,sgn(€)). Then 
the Chebyshev criterion is 


le,-| 


~The j 


W 


and the Chebyshev estimate is 
6¢ = 6; — wZ7'o = Z7 (Y; — wo). (5) 


Remark 1 Although for a given reference set both methods give the same w 
and Ôc, an implementation based on the second method is more convenient 
to compute the exact LMS estimate. To our knowledge, no exact algorithm 
to compute the LMS estimator using the second method has yet appeared in 
the literature. The exact LMS algorithm of Stromberg (1993) computes the 
Chebyshev estimates of reference sets using the first method. 


Remark 2 Assume n = p+ 1 andw > 0. If the matrix Z has rank p but 
it does not verify the Haar condition, then there exist multiple Chebyshev 
estimates (see, e.g., Cheney, 1966, p. 42, problems 6, 7). For our purposes, 
the search can be restricted to those Chebyshev estimates for which allp+1 
residuals are equal in magnitude to w. Using the first method, if multiple 
Chebyshev estimates exist, at least one ê; will be equal to zero. For these null 
residuals, the respective components of s may be set equal to 1 or -1, and 
using (3), multiple Chebyshev estimates are obtained. In the second method, 
an elemental set J with nonsingular design matriz has to be selected from 
the reference set. For this method, when multiple Chebyshev estimates exist, 
some E; will be equal to zero. Changing conveniently the values assigned to 
the signs of such €;’s, expression (5) yields multiple Chebyshev estimates. 


When n > p+ 1, a naive algorithm computing the Chebyshev estimate 
examines all possible reference sets: For each reference set, it computes 
the Chebyshev criterion w through (2) or (4), and selects the reference 
set which yields the greatest w. Afterwards, it computes the Chebyshev 
estimate using (3) or (5). However, several nonexhaustive algorithms have 
been proposed in the literature for this purpose. These algorithms are based 
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on a linear programming formulation of the Chebyshev regression problem 
that consists of 


Minimize w, s. t: -w<y,—20<w, i=1,...,n. (6) 


We prefer the algorithm of Armstrong and Kung (1980) (AK) which uses 
the dual form of (6). The source code of a FORTRAN implementation of 
the AK algorithm appears in Armstrong and Kung (1979). (The published 
code contains one harmful misprint which is corrected in Agulló, 1994). 
The AK algorithm obtains a finite sequence of reference sets that converges 
to the optimal reference set. In this sequence, consecutive reference sets 
only differ in one index. In each iteration, the index to be added to the 
current reference set would be that associated with the maximum absolute 
residual. The index to be dropped is selected using a rule that increases 
the objective function value. Notice that the AK algorithm obtains in 
each iteration a lower and an upper bound for the Chebyshev criterion. 
The lower and upper bounds are, respectively, the objective function value 
and the maximum absolute residual. When both bounds are identical, 
convergence is achieved. 


3 Exhaustive LMS algorithm 


Assume that matrix Z in model (1) verifies the Haar condition. Since the 
exact LMS estimate minimizes the hth smallest absolute residual, it must 
minimize the maximum absolute residual for some subset of size h of the 
data (i.e., Ops is the Chebyshev estimate to some h-subset). Moreover, 
from the properties of the Chebyshev regression (see Section 2), we conclude 
that the exact LMS estimate is equal to the Chebyshev estimate of some 
reference set. Further, the minimized hth absolute residual is the same as 
the Chebyshev criterion of this optimal reference set. Consequently, in or- 
der to compute the exact LMS estimate we can use an exhaustive algorithm 
that examines all reference sets. For each reference set the algorithm com- 
putes the Chebyshev estimate, and sets Ôr ms as the Chebyshev estimate 
with smallest hth absolute residual. 

It is possible to carry out some fast preliminary tests to discard those 
reference sets which cannot be optimal. The tests are based on the following 
fact. Suppose we know that the optimal hth absolute residual is not greater 
than w*. Then a reference set (with Chebyshev criterion w and Chebyshev 
estimate Ôc) cannot be optimal if it verifies any of the following three 
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impossibility conditions: 
[I] oO > Qe. 
MN #{1<i<n,|e(Gc)|>0} > n-h, 
[III] #{1<i<n, ei(8c)| >w*} > n-h, 


where we use the symbol # to denote the cardinal of a set. 

Stromberg’s exact algorithm examines all reference sets, and computes 
the Chebyshev estimate 6¢ for each one using (3). It uses impossibility 
condition [III] to avoid the computation of all absolute residuals and/or 
their sorting for some Chebyshev estimates. When a reference set does not 
verify the impossibility condition [III], its Chebyshev estimate becomes 
the potentially best estimate and the absolute residuals are sorted to find 
the hth smallest absolute residual. ‘This absolute residual becomes w*. 

We describe next an exact LMS algorithm based also on an exhaustive 
enumeration of reference sets. It uses impossibility conditions [J] and [JJ] 
to avoid unnecessary calculations. Initially the algorithm sets w*: = oo, and 
R*:= Ø. Then it considers all elemental sets J = {j1,...,jp} C {1,...,n} 
with 1 < 7) <...< Jp < n — 1. For each elemental set J, it computes the 
exact fit Ô TE Z; Yy: Then it examines the reference sets formed by adding 
an index r to J , where jp < r < n. For each reference set R = JU {r}, 
the Chebyshev criterion w is computed using (4). If R verifies impossibility 
condition |Z], then R cannot be optimal and it is discarded. Otherwise, 
the algorithm computes the Chebyshev estimate 6¢ of R using (5), and 
calculates the Chebyshev residuals until either a) the number of absolute 
residuals greater than w equals n — h + 1, or b) the number of absolute 
residuals that are not greater than w becomes equal to h. When a) occurs, 
R cannot be optimal (because R verifies impossibility condition |ZI]) and 
it is discarded. When b) occurs, the algorithm sets w* := w, R*: = R, and 
ĝ*:= c. When the algorithm stops, w* yields the optimal criterion, R* 
the optimal reference set, and 6* the exact LMS estimate. 

Matrix inversion is not needed to implement this algorithm using, for 
instance, an LU decomposition of Zz. Notice that the same LU decompo- 
sition is used for evaluating, on average, n/(p +1) reference sets and it is 
only updated when the elemental set changes. 

Our algorithm differs from Stromberg’s in many important respects. In 
Stromberg’s algorithm the Chebyshev fits are based on the LS fits of refer- 
ence sets. However, in our proposal the Chebyshev fits are based on exact 
fits of elemental sets. So, our algorithm uses the same exact fit to examine 
several Chebyshev fits, and it can be implemented adapting the available 
LMS algorithms based on elemental sets. A further benefit of our approach 
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is that the computation of an approximate high breakdown multivariate 
estimate for matrix Z (such as Rousseeuw’s (1985) minimum volume ellip- 
soid estimate) can be done at the same time as the exact LMS computation 
with little additional cost (see Hawkins and Simonoff, 1993). Stromberg’s 
algorithm requires the computation of the Chebyshev estimators for all ref- 
erence sets. However, our algorithm uses impossibility condition [J] to avoid 
the computation of the Chebyshev estimate and the Chebyshev residuals 
for a significant fraction of all reference sets. Moreover, when the com- 
putation of Chebyshev residuals has to be started, it seldom requires to 
compute all residuals and always completely avoids sorting of residuals. 

The proposed algorithm examines (%4) = O(n?*") reference sets. At its 
worst, it must compute n residuals for each reference set, therefore requiring 
at most O(n?*) time. On the other hand, Stromberg’s algorithm requires 
O(n?* logn) time, but if the hth smallest absolute residual is found in O(n) 
time, it also requires O(n?*?) time. Notice that although both algorithms 
have the same computational complexity, this does not imply that their 
true computer times are the same, since the involved constant factors can 
be very different. Empirical results suggest that our proposal is about five 
times faster than Stromberg’s algorithm (see Section 5). 


Remark 3 To analyse data sets whose design matriz has rank p and does 
not necessarily verify the Haar condition, the proposed algorithm requires 
two modifications. The first modification is concerned with the possibility of 
any nonsingular matrix Zz. For each elemental set J, the numerical rank 
of Zz should be checked. If the rank of Zz is p no problem arises. When 
the rank of Zz is smaller than p — 1, the next elemental set in the list is 
examined. If both matrices Zj and Zp have rank p—1, the next reference set 
in the list is examined. Finally, if the rank of Zz equals p—1 and the rank 
of Zr is p, then the algorithm selects an elemental set from R whose design 
matriz has rank equal to p. Now, the complementary index of the elemental 
set plays the role of r. The second modification deals with the possibility 
of having some R with multiple Chebyshev estimates. The multiplicity of 
Chebyshev estimates occurs when some £; in (4) is equal to zero. If this 
occurs, we must examine all Chebyshev estimates whose residuals indexed 
by R are equal in magnitude to w (see Remark 2). 


4 Branch and bound algorithm 


As we noted earlier, the exact LMS estimate coincides with the Chebyshev 
estimate of some h-subset. In fact, the optimal h-subset is the one with 
smallest Chebyshev criterion. In this Section we propose a Branch And 
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Bound (BAB) algorithm to find this optimal h-subset. 

Given an index set Jm = (j1,---,jm) C {1,... n} with size m, let Zz, 
and Y;, be, respectively, the submatrices of Z and Y formed by the rows 
indexed by Jm. For the regression of Y;,, on Zj,,, we denote the sum of 
squared LS-residuals by $(Jm), the sum of absolute values of LS-residuals 
by ¢'(Jm), and the Chebyshev criterion by w(Jm). It is easy to prove that: 


JCJ =o) a oG ), (7) 
B(Jm) < w(Jm), (8) 
(Im) > 0 = B(Jm) < B'(Jm) < (Jm), (9) 


where B(Jm) = Vọ(Jm)/m and B'(Jm) = 6(Jm)/¢' (Jm). 

Note that the monotonicity property (7) implies that the Chebyshev 
criterion cannot decrease if one or several cases are added to the current 
set of cases. 

In the search of the optimal h-subset, the BAB algorithm considers 
subsets whose size is not greater than h. The current subset will be denoted 
by Jm = (j1,---)Jm). We start by considering sequences that verify 


ji < --- < jm- (10) 


The generation of subsets is organized through a tree of nested subsets 
with h node levels. In each node, a further index is added from the original 
n indices. We use the tree described in Narendra and Fukunaga (1977). 
Figure 1 shows a tree for n = 6 and h = 3. A node at level m is labeled 
with the value of jm and represents a subset of size m. Terminal nodes 
represent the (;) possible subsets of h observations. When an exhaustive 
inspection of the tree is carried out, the tree is examined by moving down 
each branch, working from right to left. When a terminal node is reached, 
the inspection continues from the most recent node that has unexplored 
branches. 

The efficiency of the BAB algorithm follows from the possibility of jump- 
ing in the exhaustive inspection sequence of the tree. In any stage of the 
search, let w* be the smallest Chebyshev criterion for a h-subset so far ob- 
tained. If the subset being considered is Jm, 1 < m < h, and if it verifies 
that w(Jm) is greater than w*, then, as a consequence of the monotonicity 
property, all h-subsets that contain Jm can be rejected implicitly. 
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Figure 1: Tree for n = 6 and h = 3. Labels at nodes denote the case that 
is added there. 
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We describe now the basic BAB algorithm. Initially it sets w*: = oo and 
J*:= Ø. If the current subset Jm verifies that B(Jm) is greater than w*, 
then, from (7) and (8), the optimal h-subset cannot contain Jm, and the 
exploration can be continued from the most recent node that has unexplored 
branches. In this case, a jump occurs in the exhaustive sequence. When 
the current node is terminal (i.e., m = h) and it verifies that B(J,) is not 
greater than w*, the bound B'(Jp) is computed. If B’(J,) > w*, by (7) and 
(9), J, cannot be optimal, and the exploration of the tree continues. When 
B'(Jn) < w*, the iterative computation of w(J;,) (using the AK algorithm) 
starts. If, in some iteration of this computation, the objective function 
value is greater than or equal to w* (what implies w(J,) > w*), then the 
iterative process stops and the exploration of the tree continues. When the 
AK algorithm converges (i.e., w(J;,) is smaller than w*), the BAB algorithm 
sets w*: = w(Jp,) and J*:= Jp, and the exploration of the tree continues. 
When the BAB algorithm stops, J* is the optimal h-subset and w* is the 
minimal value of the LMS objective function. 

The computation of the bound B requires the sum of squared LS- 
residuals. This computation is carried out by using an orthogonal decom- 
position procedure (Gentleman, 1974) applied to (Z7;,Y7). When a case 
is added to the current subset, the orthogonal factors are updated. When 
the algorithm operates descending by a branch of the tree, the orthogo- 
nal factors of each level are saved. In this way, the algorithm can reselect 
the adequate factors when it returns to a smaller level node. This permits 
updating the orthogonal factors when the inspection continues by an unex- 
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plored branch. When a terminal node is reached and it verifies B( Jp) < w*, 
then the computation of the bound B’(J;,) is required. This computation 
is carried out quickly from the orthogonal factors. If B’(J,) > w*, the 
computation of w(J,) is avoided, because Jp is not optimal. 

We describe now two strategies to improve the computational efficiency 
of the basic BAB algorithm. Suppose that m < h, rank(Z,,,) = p, and 0 < 
B(Jm) < w*. Since the orthogonal decomposition of (77, Y7) is available, 
the bound B’(Jm) can be quickly computed. If B’(Jm) > w*, then, by (7) 
and (9), a jump is justified. Notice that ifm = p+1, then w(Jm) = B’(Jm), 
by (2), whereas if m > p+ 1, then w(Jm) > B'(Jm). If m > p+1 
and B’(Jm) < w(Jm), then the iterative AK algorithm to compute w(Jm) 
can be started and continued until either a) the current objective function 
value is greater than or equal to w*, or b) the current greatest absolute 
residual is smaller than w*. From the properties of the AK algorithm (see 
Section 2), when a) occurs, we conclude that w(Jm) is greater than w*, 
and consequently, a jump in the exhaustive sequence is justified. However, 
when b) occurs, w(Jm) is smaller than w*, and a jump cannot be justified. 

In a node jm at level m with ancestor nodes 31, ..., Jm—1 only certain 
indices can be selected as successors in level m + 1. All indices selected at 
nodes with labels 71, ..., jm and all indices that appear in the “brother” 
nodes to the left of nodes jı, ..., and jm are not available. Assume that the 
current node has n, successors at level m +1. The basic BAB algorithm 
selects the first n, available indices and assigns them to the successors by 
order from left to right in the tree. If we do not impose restriction (10) 
in the subset generation process, we can use a sorting rule for assigning 
available indices to the successors. Suppose that the current design matrix 
ZJ„ has rank p. For each available index 7 we can compute the increase in 
the sum of squared LS-residuals caused by adding the index 2 to Jm. This 
increase, which we call ith partial increment, is 


Yi — 2(Z5, Zim) 25, Von 


Yi = L+2(Z5 Zin) tzi 


The sorting rule selects the n, available indices with greatest partial in- 
crements and assigns them to the successors according to the decreasing 
magnitude of the partial increment from left to right in the tree. If the 
BAB algorithm uses that sorting rule and the current subset Jm verifies 
that B(Jm) is greater than w*, then the subtree emanated from the cur- 
rent node does not have to be examined. This follows from (7) and (8). 
Furthermore, if the design matrix of the previous level has rank p, as a 
consequence of the sorting rule all brother nodes to the left of the current 
node and the subtrees descending from such nodes can also be discarded. 
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Remark 4 Denote y* = (m + 1)(w*)? — d(Jm), and let Y! = Yn.n, be 
the nsth greatest partial increment from the nq available indices. If y is 
greater than y*, it is not necessary to assign indices to successors, because 
the optimal h-subset cannot contain the current subset Jm. Sometimes the 
computation of partial increments for all available indices and its partial 
sorting can be avoided. Suppose that we proceed to compute the partial 
increments and maintain simultaneously a counter, np, to determine the 
number of partial increments that are not smaller than y*. If, at some 
stage of this process, ny equals ns, then y will necessarily be greater than 
y*, and a jump in the exhaustive sequence is justified. 


Even if the tree is enumerated through the sorting rule, the efficiency of 
the BAB algorithm depends on the initial ranking of cases in the data set. 
Notice that for nodes at low levels, the sorting rule assigns the first available 
indices to the successors. The empirical tests we have carried out suggest 
the convenience of reassigning the indices 1,...,n to the cases according 
to the decreasing magnitude of residuals based on an approximate LMS 
estimate. In this way, the first h-subset evaluated by the BAB algorithm 
gives the approximate LMS estimate and a good upper bound for the opti- 
mal value of the objective function is found quickly. The approximate LMS 
estimate is computed by the BAB algorithm before starting to inspect the 
tree. For this task, we use a modification of the Feasible Subset Algorithm 
(FSA) described in Hawkins (1993). 


5 Empirical comparison of the LMS algorithms 


To compare the efficiency of the LMS algorithms, we have considered the 
following FORTRAN implementations: 

e MVELMS, code of Hawkins and Simonoff (1993). We use the option 
that examines all elemental sets and adjusts the intercept for each elemental 
set. 

e EXTLMS, implementation of Stromberg’s exact algorithm due to Haw- 
kins, Simonoff and Stromberg (1994). 

e LULMS, our implementation of the algorithm described in Section 3. 
It uses an LU decomposition of the basis as described by Bartels and Golub 
(1969) to obtain solutions of square systems. The used code implements 
the first modification described in Remark 3, but not the second one. 

e MVELMS1, our modification of the MVELMS code. This modification 
is based on the algorithm explained in Section 3, and omits from the search 
those elemental sets whose design matrix is singular. 

e BABLMS, our implementation of the BAB algorithm described in Sec- 
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tion 4. It obtains the exact LMS estimate without exhaustive enumeration 
of h-subsets. 


Table 1: Comparison of CPU times (in seconds) for LMS algorithms. 


CPU TIME (in seconds) 
DATA SET | n p h| EXTLMS MVELMS1 LULMS MVELMS BABLMS 


42.69 13.30 8.30 8.35 0.66 
43.13 9.95 7.09 10.44 0.32 
2.97 0.88 0.50 0.44 0.11 
804.18 343.08 159.01 102.31 10.82 


Aircraft 
Coleman 
Delivery 
Educat 
Hawkins 
Races 
Salinity 
Stacklos 
Wood 


7774.73 4823.85 1905.11 834.07 531.31 
13.46 4.01 2.25 1.75 0.27 
31.92 8.85 5.38 9.16 0.43 

6.48 1.65 1.21 1.15 0.11 
43.02 9.78 7.09 10.33 0.38 
8762.58 0215.35 2095.94 974.00 544.90 


Or FR WP BR WD Cl 


From these algorithms, MVELMS gives only an approximate LMS es- 
timate, whereas the remaining algorithms are exact. EXTLMS, LULMS 
and MVELMS1 guarantee an exact LMS estimate when the observations 
are in general position, and BABLMS obtains a true LMS estimate when 
the design matrix has rank p, which is a much weaker condition. In our 
implementations, we use subroutines of Armstrong and Kung (1979), Miller 
(1992), Miller and Nguyen (1994), Ridout (1988), and Wichmann and Hill 
(1982). All source codes were compiled with FTN77 for 486 and were run 
on a 90 MHz Pentium computer. The central processor unit (CPU) times 
for several data sets are shown in Table 1. All data sets, excluding the one 
labeled “Races”, can be found in Rousseeuw and Leroy (1978). The Races 
data set is from Atkinson (1986). Figure 2 shows the multiple boxplot of 
relative efficiencies, measured using the ratio of CPU time to EXTLMS 
CPU time. 

Table 1 and Figure 2 show that for the used data sets the BABLMS 
algorithm dominates all other algorithms. On average, the BABLMS al- 
gorithm is about 65 times faster than the EXTLMS one, and at least one 
order of magnitude faster than the exhaustive MVELMS algorithm that is 
approximate. This may be surprising, because BABLMS searches a subset 
within a collection of size (p), and this number will, in general, be much 
greater than, for instance, (vi which is the number of subsets inspected 
by EXTLMS. The reason why the BABLMS algorithm is more efficient 
is because the number of subsets explicitly visited by it is usually smaller 
than the number of reference sets. 

Notice that the exact LULMS algorithm is about five times faster than 
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the EXTLMS algorithm, and it does not need much more computer time 
than the approximate MVELMS algorithm based on exhaustive search over 
all elemental sets. 


Figure 2: Multiple boxplots of relative efficiency of LMS algorithms. The 
baseline corresponds to the EXTLMS algorithm. 


RELATIVE EFFICIENCY 
RELATIVE EFFICIENCY 


BABLMS MVELMS LULMS MVELMS1 MVELMS LULMS MVELMS1 


To sum up, we can conclude that for small or moderate sample sizes the 
proposed algorithms are more efficient than the other finite exact algorithms 
proposed in the literature. 
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Abstract: In this paper we consider the following problem. Let X = 
{£1,..., Zn} be a set of observation points endowed with a partial order 
<, and let y1,...,Yn be the values of the dependent variable y. We are 
searching an isotonic function f : X — R (i.e. x; < x; implies that 
f(zi) < f(xz;)) that minimizes the lp—error 
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We recall some general algorithms for solving this and related regression 
problems and we present new polynomial algorithms for some versions of 
the isotonic regression problem. 
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1 Introduction 


The basic isotonic regression problem can be formulated as follows: given 
values y1, . . . , Yn Of the dependent variable y, corresponding to values z1,..., 
£n of the independent variable z, which constitute a set X with a partial 
order < (i.e., a reflexive, transitive and antisymmetric binary relation on 
X), fit to the y; a best function y = f(x) which is non-decreasing (alias 
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isotonic) with respect to < . The error norm usually chosen is lg : we are 
seeking an isotonic function f on X that minimizes 


Dalf) = [i — Fle). 
i=l 


Algorithms for this problem have received a great deal of attention and 
a collection of them have been discussed in details in [3, 11, 25]. In the 
case when ~ is a total order on X all of the algorithms work in linear 
time O(n) provided ~ is given. In particular we mention the simple Pool- 
Adjacent—Violators algorithm introduced by Ayer et al. [1| and popularized 
by Kruskal [18] under the name Up-and—Down Blocks algorithm. It has 
been extended to rooted trees by Thompson [31] with the Minimum Vio- 
lator algorithm. The Pool—Adjacent—Violators algorithm is also implicit in 
van Eeden [9], who extended in [10] the procedures to regressions bounded 
by two given functions. 

The situation becomes much more complex if < is a partial order, say 


T1,- --,Zn are points in the d-dimensional space and z; < z; if and only if 
f= (xs , see oi), = (x, sees a) and z\*) < x") foreach 1 < k < d. 


Although some algorithms described in [3] are applicable in the general 
case, their computational complexity is already exponential; see [8]. A 
convergent numerical algorithm for X C R? has been proposed in [8]. Ge- 
ometrically, one can formulate the lg—-regression problem as the computing 
of the projection of the vector y = (y1,..-,Yn) onto the convex cone K of 
the isotonic functions on X. Since K is defined by a finite number of con- 
straints, it is polyhedral. Therefore, one can use any algorithm for solving 
a quadratic optimization problem, especially those established for project- 
ing onto polyhedral cones, as, for example, that presented in [19]. The 
alternative procedure of Dykstra [7] is also efficient for polyhedral cones. 
However, several specific algorithms have been developed for regression 
problems. The reader will find a vast literature on this topic. 

Among other criteria, the choice of lz as an error norm is due to the 
connection with special estimates. Usually, the estimates studied in order 
restricted statistical inference can be expressed by compact ” max-min” for- 
mulas (of course, being an useful tool in consistency proofs, these formulas 
are not very appropriate for computing of the estimates). For instance, let 
M(A) be the mean of a collection A of observations taken from a poset X. 
Now, a subset L of X is called an upper layer if x; E€ L and z; < x; imply 
that x; € L. Then 


F(a) = MAX{L: 2;¢L} MNp: mgr} M(L- L’) 


Polynomial algorithms for isotonic regression 149 


is an isotonic function and could be used as an estimate. It is shown 
in [3] that f provides the best l2-approximation of yj,..., y, and can be 
calculated by a special minimum lower sets algorithm introduced by Brunk 
et al. [5] (the complexity of the latter is exponential). A deep generalization 
of this result to all Cauchy mean value functions has been obtained in 
[28] (a function M defined on the nonempty subsets of X is said to be a 
Cauchy mean value function |27, 28] if M(A U B) belongs to the segment 
[M(A), M(B)] whenever A and B are nonempty and disjoint subsets of 
X). A description of all linear Cauchy mean value functions has been given 
in [21]. The result from [28] asserts that if M is such a function and the 
error measure D(f) verifies three rather natural conditions, then f given 
by the abovementioned ” min-max” formula minimizes D(f) subject to the 
restriction that f is isotonic on X. Moreover, f can be computed using a 
refined version of the minimum lower set algorithm. As is noticed in [28], 
this result includes all |,—-regression problems (1 < p < oo) in their most 
general form as special cases (however, the modal regression problem [29] 
does not fit in this framework). 

Namely, assume that with each element zx; of a poset (X, <) is associated 
a set of numbers y;1,..., Yir,, corresponding, for example, to a sample from 
the ith distribution. We are looking for an isotonic function f on X (i.e., 
x; < x; implies f(x;) < f(z;)) that minimizes 


n Ti 1 
Dyf) = [X diya — F(a)? 
i=1 l=1 
If p = 1 we obtain the isotonic median regression problem that corresponds 
to the lį—error norm: 


D(f) = D [yu — f (zi)l. 


i=1 l=1 


For X C R this problem have been investigated in [26, 24] (unfortunately, 
the algorithm presented in [24] contains a serious gap, since its two steps 
do not cover all possible cases). The minimum lower set algorithm from 
[28], acting on the collection of upper layers, has an exponential complexity 
already if this collection is of exponential cardinality. This is the case of 
rather simple partial orders as rooted trees or series—parallel partial orders. 
To our knowledge there are no polynomial time algorithms to find a best 
isotonic 1,-regression function (p < oo) in the general case of a partial 
order or even in case when X C R$, d > 2. It will be interesting and impor- 
tant to know what instances of this problem are NP-complete algorithmical 
problems. 
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2 Isotonic /,-regression problem for rooted trees 


This section is devoted to the isotonic l,—-regression problem (1 < p < oo) 
for partial orders whose covering graphs are rooted trees. We propose a 
simple extension of the maximum (minimum) violator algorithm, originally 
established for the l2—criterion. Let us consider a rooted tree, the vertices of 
which are the elements z1,...,£n of a set X. The tree is here oriented from 
the leaves to the root. To every vertex x;, a sample yi1,..., Yir,; 1S preas- 
signed. We consider the /,-regression problem defined in the introduction: 
minimize D,(f) subject to the constraint that f is isotonic. The y, define 
a vector of R”, where r = ĵ ri. Noting 0; = f(z;) and duplicating r; 
times the 6;, we get a current vector 0 € R”, and the minimization problem 
may be written as: minimize ||y — @||,, with equality and inequality con- 
straints over 9. Such a problem corresponds to the projection onto a closed 
polyhedral cone, with respect to a norm lp. That provides the existence of 
a solution. 

For every sample of real numbers A = {u1,..., uk}, let us consider 
the problem: min, baa lz — Uuj P]? It is well-known that such a problem 
admits a solution. This is the mean for p = 2, the midrange for p = oo and 
a median point for p = 1. Every solution lies between minu; and maxu,, 
and due to the convexity, the set of solutions is a closed interval. We denote 
it by M(A) = [a,b]. Moreover, it is easy to see that ya |£ — uj P]? is 
strictly decreasing when z varies from —oo to a, and is strictly increasing 
when x varies from b to +00. Again, given two samples A and A’, we 
denote by A+ A’ their amalgamation. Let M(A) = [a,b], M(A’) = [a', 0’), 
M(A + A’) = [c,d]. Then, clearly M(A + A’) = M(A) A M(A’) provided 
M(A)NM(A’) # @. Moreover, it is easy to prove that M obeys the Cauchy 
mean condition: if M(A)N M(A’) = @ with, for instance, b < a’, then b < c 
and d <a’. 

Now we describe our algorithm. At a current step, we have a rooted 
tree with a partition of X, say V, as a set of vertices. To every vertex v; 
of V, a sample A; is assigned, where A; stands for the amalgamation of 
the samples {y;j,l = 1,...,r;} for all j such that x; € vi. Initially, V is 
the finest partition of X. For every v; € V, denote M(A;) = [a;, bi]. If for 
every edge (v;i, vj) of the current tree with v; < vj we have a; < aj, then 
put f (£k) = a; for every £k E€ v; and stop. Otherwise, find v9 € V with 
predecessors v1,...,Um obeying the following conditions 

(i) a0 S amaz ‘= MAXj=1,....mQi3 

(ii) for alli =1,...,m if vj < vi then a; < ao. 

Then aggregate vg and all v; such that a; = amaz, and amalgamate 
the corresponding samples Ag and A;. We get a new partition of X with 
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vo := vo U {Ui : ai = amaz} and all other subsets of V, and a new rooted 
tree, to which we apply the same procedure. 


Proposition 1 f obtained by this algorithm minimizes the l,—criterion 
function Dp(f) subject to the restriction that f is isotonic. The function f 
can be defined in total O(nr) number of operations. 


Proof: The current step corresponds to the new isotonic l,-regression 
problem: 


(P): minimize [} pev Lye; li — yP]? with the constraints: v; < vj 
implies 0; < 0;. 

We prove by induction that every solution Ô of the reduced problem 
furnishes a solution f of the initial problem by letting f (£k) = 6; provided 
£k € vi. Clearly, that is true at the initial step. Suppose that at a current 
step, there is vo obeying the conditions of the algorithm. 

First, we show that there is a solution 6 of (P) verifying Io = Ô; for some 
j € {1,..., m}. Suppose there is a solution 6* of (P) verifying: 65 > 6* for 
all ¿ = 1,...,m. Then a; < 0¥ < 0ğ < bo for all i = 1,...,m. Indeed, if 
for some i, 6% < a;, defining 6’ by 0; := 0¥ + 6 and 0' := 0* otherwise, 
6’ should satisfy the isotony constraints and should reduce the value of 
the criterion for 6 > 0 sufficiently small. Similarly, if 65 > bo, defining 6” 
by 65 := 05 — 6 and 6” := 6* otherwise, 6” should satisfy the constraints 
and should reduce the value of the criterion for 6 > 0 sufficiently small, 
establishing our assertion. 

Let 0; realize max{67 : i = 1,..., m}. Then ao < maxa; < 67 < Oğ < bo. 
Define 0 by ô, := 0} and Ô = 6* otherwise. Then clearly Ô satisfies the 
isotony constraints and preserves the value of the criterion. Thus, ĝ is a 
solution of a problem obtained from (P) by adding another constraint 09 = 
0j, which is clearly equivalent to a reduced problem of (P) by aggregating vo 
and vj, by amalgamating Ag and A; to Ap and by joining the predecessors 
of vj to up := UgUv;. The hypotheses show that ag < amaz, where M (Ap) = 
(ay, bo]. By induction and previous assertion one can deduce that there is 
a solution 6 of (P) verifying Oo = 0; for every i such that a; = amas. Since 
every solution of the reduced problem defined by the algorithm is a solution 
of (P) satisfying those new constraints, the induction is proved. Moreover, 
at the end of the algorithm, 6 defined by 6; := a; is clearly a solution of 
(P). o 

The condition required in the algorithm for pooling two adjacent vertices 
indicates that we must start from leaves and proceed in an up-and-down 
way. However, when we have a total order, the proof shows that we may 
aggregate any pair of consecutive vertices violating isotony condition, as in 
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the Pool-Adjacent—Violator algorithm. 


3  Isotonic /,.—regression problem and its variants 


In this section we consider the isotonic regression problem with the le- 
error norm (i.e. the well-known uniform or Chebychev measure of error); 
for algorithmic approaches to similar approximation problems see [4, 15, 14] 
and the references there. For this problem we present a strikingly simple 
optimal estimate which can be computed in time proportional to the size 
of the covering graph of the poset (X, <). If X C Rf and |X| = n the 
computational complexity to compute this estimate is O(dn?). The result 
is due to Ubhaya [32, 33] and was rediscovered by one of the authors of 
this note. Also we consider the p—isotonic regression problem with the 
Chebychev norm and show how to reduce it to a graph—theoretical problem. 
In the particular case p = 2 this allows to present a polynomial algorithm. 


3.1  Isotonic |,.-regression problem 


Let X = {21,..., Zn} be a set of observation points endowed with a partial 
order < and let 41,..., Ym, be the corresponding values of the dependent 
variable y. The simplest (and the most economical way) to present a partial 
order is to define its covering graph G = (X, E) : in G two elements x; and 
xj are joined by an arc if x; < x; and there is no other element rz so 
that x; < £k < zj. The goal of the isotonic regression problem with the 
l,,>-norm is to determine an isotonic function f : X — R that minimizes 
the I[,,—error 


Do(f) = maxz;ex lyi — f (z:)| 


(in [12] the history of using this and other criteria in estimation procedures 
has been discussed). 
For an element x; € X consider the order ideals 


L.(a;) = {xj € X : 2; < zi}, L> (xi) = {zj € X : zi < zj}. 
Let 
f*(xi) = max{y; : zj € L-.(a;)} 
and 
fa(ai) = min{y; : zj € L» (2:)}. 
The reflexivity of < implies that D.(r;) O Ly (2i) = {zi}. In particular, 
f* (xi) = fx(zi). 


Proposition 2 The function f(x;) = 5(f* (xi) + fe(ai)) minimizes Doo(f) 
subject to the restriction that f is isotonic. Given the covering graph G, 
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the values of f can be computed in total time O(|E|). If X C R¢, then the 
computation of f can be performed in O(dn*) time. 


Proof: Let (x;,2;) be an arbitrary edge of G, and assume z; < £j. Since 
L.(xi) C Lx(a;) and L, (rj) C Ly (2%), we conclude that f*(x;) > f*(z:) 
and f,(z;) > fx(xi), yielding that f is isotonic on X. 

Suppose by way of contradiction that D..(g) < Doo(f) for an isotonic 
function g. Let € = Deol f) and consider an element x; such that € = 
|f (zi) — yil. Let f*(ai) = yj and f,(a;) = yx for elements x; € L.(z;) 
and x, € L,(z;). Suppose without loss of generality that y; belongs to the 
segment (yz, f(2:)]. Since |g(zi) — yi| < € and g(z;) < g(xi) we immediately 
obtain that g(x;) < f(z;). But then |g(x;) — y;| > y; — f(xi) > €, contrary 
to the choice of g. 

The values of f* can be computed recursively, starting from the minimal 
elements of X. If we know f* for all predecessors of x;, then f*(x;) is the 
maximum among the y; and max{f*(z;) : (xj xi) € E}. Analogously, 
the values of f, can be computed recursively starting from the maximal 
elements of X. If fx is computed for all successors of x;, then f,(xz;) is the 
minimum among the y; and min{ f,(z;) : (vi, zj) € E}. Evidently, this can 
be done in O(|E}) time. If X C R?, then the covering graph of the resulting 
poset can be computed in O(dn?) time. O 

In Figure 1 we present an example of application of Proposition 2 (the 
optimal error is e* = 4). 


FIGURE 1. 


With few efforts one can present an optimal estimate to the general 
isotonic |,.-regression problem. Assume as before that with each element 
zi of (X, <) is associated a set of (distinct) numbers yj1,..., Yir;, and we 
wish to find an isotonic function f on X that minimizes 


Dalf) = max,,cexmax/=1,....r;|Ya — f(x:)|. 
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For an element z; E€ X set 


f E) = MaXzrjeLz(z;)MAXI=1,...,r; Yjl 


and 


falzi) = ming; EL, (s;)MİN=1,...,r} Yil: 


We assert that the function f(z;) = 5(f*(2i) + fe(xi)) minimizes D,o(f) 
subject to the restriction that f is isotonic. For this we extend the partial 
order < from X to the multiset {yj : i = 1,...,n,l = 1,..., ri} : set 
Yi < Yjt if and only if z; < xj or i = j and yy > ys. Let f (ya) be the 
function defined as in Proposition 2. One can easily note that f (ya) = f (zi) 
for all y (l =1,...,7;). From this and Proposition 2 we deduce that f (xi) 
minimizes Dəæ( f). T he values of f can be computed in O(|E| + 32%, ri) 
number of operations. 


3.2 p-lsotonic /|,,—regression problem 


In some recent papers [22, 2, 20] new generalizations of the classical linear 
regression problem have been given. For example, [20] presents an eff- 
cient algorithm for partitioning a planar set S = {s1 = (z1, Y1), -< -, Sn = 
(Ln, Yn)} into two parts Sı and S2 such that 


> (yi — filzi) )) pa X (yi — f2(xi))? 


siESı s;ES2 


is minimized, where fı and fz are the regression lines of the sets S1 and 
S2 (the multidimensional case is treated in [16]). Agarwal and Sharir [2] 
presented an algorithm with complexity O(n?logřn) for solving a simi- 
lar problem, replacing the lə—criterion function by the /,.—error function. 
Namely, they are searching a bipartition of a planar set S such that their 
maximum width is as small as possible. Recall that the width of a set is 
the smallest distance between a pair of parallel supporting lines. Equiv- 
alently, it is neccesary to find two linear functions fı and f2, such that 
max,s,¢s{minj=1,2|y; — fi(z;)|} is minimized (one can formulate this prob- 
lem for p—partitions as is done in [22]). For isotonic regressions, this leads 
us to the following general formulation. 

As before let X = {21,...,2n} be a set with a partial order <, and 
let Y1,.-.,Yn be the corresponding values of the variable y. We wish to 
find a partition X1,..., Xp of X and the isotonic functions f1,..., fp on 
X1,...,Xp, respectively, such that the /..—error 


Doo(fi,---;fp) = max{maxz,ex,lys — fi(zi)|,...,maxe,ex,|yi — Fp(xa) |} 
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is minimized. Below we will show how to reduce this problem to a special 
graph-theoretic problem and how to solve it efficiently for p = 2. Define 
a symmetric matrix D = (dj), where dij = ¿(yi — yj) if zi < zj and 
Yi >= Yj, and di; = 0 otherwise. Let e* be the minimum of the function 
Doo(fi,.--, fp). From Proposition 1 we obtain the following result. 


Lemma 1 €¢* is an element of the matriz D. 


To find an optimal partition of X with respect to the criterion function 
D(f, .. -, fp) we proceed as follows (the idea is borrowed from the meth- 
ods of solving center location problems; see for example [17, 30]). We sort 
the elements of the matrix D in the increasing order and search the obtained 
list for the minimum value which is feasible in the following sense. A value 
€ is feasible if there is a p-partition Xj,...,X, of X and the isotonic func- 
tions fi,..., fp on Xj,...,Xp, respectively, such that Doo(fi,..-, fp) < €. 
To decide if a value e € D is feasible we define a new graph Ie. The vertices 
of Te are the elements of X, and two vertices x; and z; are adjacent in Te 
if and only if either z; and xz; are incomparable or x; < x; are comparable 
and yi — yj < 2- €. A clique of Te is a subset of pairwise adjacent vertices. 


Lemma 2 c is a feasible value if and only if the vertices of Te can be 
covered with at most p cliques. 


Proof: First, assume that Do(fi,-..-, fp) < € for isotonic functions fi,..., fp 
defined on classes X1,...,Xp of a partition of X. We assert that each X; 
is a clique of the graph Ie. Assume the contrary, i.e. y; — yj > 2-€ for some 
Ti, £j E Xk, £i < xj. We can suppose without loss of generality that y; is the 
smallest value in {ys : £s E Ly (a;)}M Xk. Additionaly, we can assume that 
fk if defined as in Proposition 2. By this result f,(2;) = AGE + yt), where 
yz is the largest value in {ys : £s E€ D.(2i)} N Xp. Since |y; — f;(xi)| < € 
and fk(zi) < fe(x;), from yi — yj > 2-€ one can easily deduce that 
ly; — fk(x;)| > €, contrary to feasibility of e. Therefore, if € is feasible, 
then X1,...,Xp are cliques of the graph Te. 

Conversely, let X1,..., Xp be a covering of the vertices of I, with p’ 
cliques (p' < p). Let fy, be the isotonic function on X; defined in Proposition 
2. Pick an arbitrary element x; € Xp. Then f;,(2i) = 5(y; + yt) where yj 
is the smallest value in {ys : £s € Ly (xi)} N Xk and y is the largest value 
in {ys : £s E Dx(ri)} N Xp. Since xt < xi < xj and z, £j E X we deduce 
that 0 < yj — yt < 2- €. Since y; € lyt, yj] we immediately obtain that 
lyi — fk(zi)| < €. Therefore Dæ(f1,---, fp) < €, i.e. € is feasible. O 

The problem of covering of a graph with a given number of cliques is 
known to be NP-complete [13], however in the particular case p = 2 it can 
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be easily solved. Indeed, a graph can be covered with two cliques if and 
only if its complement is bipartite. To find a bipartition (alias bicolouring) 
of the complement I, of I, one can simply use the breadth-first search; see 
[13] for details. We can construct directly I'e : two elements z;,2; € X are 
adjacent inl, if and only if z; < x; and y;—y; > 2-e. Therefore, to solve the 
initial regression problem by Lemma 1 we must find the smallest feasible 
value in D. We use the binary search in the ordered matrix D. Namely, we 
start from a median € of this list. We construct the graph Te and check if 
this graph has a covering with p cliques. If the answer is ” yes” we continue 
the search in the first half of the list (removing the second sublist from 
further considerations). Otherwise, if the answer is ”not” , then we remove 
the first half and continue the search in the sublist of D containing the 
elements larger that e. In the current list we take a median element as a 
current € and check if it is feasible. We continue the procedure, until we 
arrive at a list containing only one element e*. This is the optimal error 
for the formulated regression problem, while any covering Xj,...,Xp of 
Te with at most p cliques and the isotonic functions f,,..., i defined on 
Xj,...,Xp according to Proposition 2, represent the optimal solution. To 
find it, we must perform O(logn) feasibility tests (namely, logn? such tests) 
in the sorted matrix D (to order the elements of D we need O(n?logn) 
operations). The graph I’, can be constructed in O(n?) time. If p = 2 
within the same time bounds one can decide if Te is bipartite. Therefore, 
the whole complexity of the algorithm for p = 2 is O(n?logn). 


FIGURE 2. 


Proposition 3 For p = 2 the optimal bipartition of X with respect to the 
ls.o-criterion function can be constructed in O(n?logn) number of opera- 
tions. 


In Figure 2 we present an optimal bipartition of the poset from Figure 
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1. Note that the optimal error is «* = 1. 


3.3 Isotonic /,.—regression problem with a given number 
of values 


Some papers [4, 15, 14] consider the following approximation problem: 
given an integer p and (21, y1),-.-,(@n, Yn) in R* with z1 <...< £n finda 
piecewise-linear function f with at most p links such that max;=1,...n| f (£:)— 
y;| is minimized. Efficient algorithms for solving this problem are presented 
in [15, 14], for motivation see [4, 15]. If instead of piecewise—linear functions 
we consider stepwise functions with a fixed number of steps we obtain a 
particular case of the rectilinear center trajectory problem investigated in 
[6] (for the latter problem [6] presents an algorithm with the complexity 
O(n?)). In this section we consider the following regression problem: con- 
sider the numbers z1 < ..., zn, and, assume that with each z; is associated 
a set of (distinct) numbers y;1,..., Yir,- Given an integer p we wish to find 
an isotonic stepwise function f that minimizes 


subject to the restriction that f takes at most p distinct values. Let a; = 
minj=1,....r, Ya and b; = max)... r; ya. The key observation is that, as in the 
previous section, the optimal error e* of Dæ is an element of the matrix 
D = (dij), where di; = |b; — a;|. Therefore we can use a binary search 
in the ordered list of the elements of D. With a current € € D we must 
answer the following question: ”There is an isotonic function f with at 
most p steps such that D.(f) < €?” To perform this test we proceed 
as follows. For a given x; denote by S; the intersection of the segments 
la; — €,a; + €| and [b; — €, bi + €]. We sweep the list z1,...,2, from left to 
right. We need three parameters S, q and S whose meaning shall became 
clear immediately. Initially, let S := S1, q := 0 and S = (—oco, z1]. At point 
x; we do the following. Find S N S;. If this intersection is nonempty, then 
set S:= S N Si, S = S U (a;_-1, zi] and go to the point 2,41. Otherwise, 
if SMS; = 0, then for all x € S define f(x) := s, where s is an arbitrary 
value from the segment S. If z; < s, then stop: the test has a negative 
answer. Otherwise, set S := S;,qg:=q+1, S = (a2;-1, £i] and consider the 
next point z;+ı (of course, if i = n we simply put f(z;) = yi and finish 
the procedure). After n steps we return answer ”yes” if q < p and the 
answer ”no”, otherwise. The complexity of this procedure is O(n). The 
proof of correctness is straighforward. To find an optimal isotonic function 
f with at most p values we must perform O(logn) feasibility tests in the 
ordered matrix D. If we simply sort the matrix D, the total complexity 
of the algorithm will be O(n7logn) (actually, this is the time to sort D, 
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because the complexity of testing is only O(nlogn). We can improve the 
whole complexity of our algorithm. Instead of constructing and sorting 
the matrix D, we can use the selection algorithm of [23]. It presents an 
O(nlog?n) time algorithm for computing the kth largest element in the 
set of all simple paths in a tree. One can view the sorted list of numbers 
{a;,b;:i=1,...,n} as a path; therefore, we can apply the algorithm from 
[23] O(logn) times, leading us to an algorithm with the total complexity 
O(nlog?n). 


Proposition 4 Given a total order 11 < ... < £n and an integer p > 
0 an isotonic function f minimizing the l,.-criterion function subject to 
the restriction that f has at most p distinct values can be constructed in 
O(nlogn?) number of operations. 


Most likely, using the parametric search as in [2, 14] one can solve this 
problem more efficiently. We leave open the question whether a similar 
problem for all partial orders is NP-complete. Finally note that within 
the same time bounds we can solve the problem of approximating with a 
stepwise function with at most p distinct values. Again, the optimal leo- 
error is an element of the matrix D. We can apply the same test, but in 
case SS; = Q it is not necessary to check whether z; < s. 
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Abstract: In this paper we outline the early history of traditional esti- 
mation procedures which are based on the use of elemental sets. There 
are two distinct classes of such procedures associated with the minimum 
values of the sum of absolute errors and the largest absolute error criteria 
respectively. As a matter of historical necessity, our study will concen- 
trate on estimation procedures of the first type. However we shall also 
discuss some recent work on the least median of squared errors procedure 
which in principle involves elemental sets of the second type. 
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1 Introduction 


In his study of the practical value of elemental set approximations to robust 
estimation procedures, Hawkins (1993, p.580) has summarised the history 
of such methods in the following terms: 

Elemental set methods have their origins in the single-predictor proposal 
by Theil (1950). The extension of the idea to handling outlier problems 
in multiple regression was made independently by Rousseeuw (1984) and 
by Hawkins, Bradu and Kass (1984). They also arise naturally in the 
expression of the OLS multiple regression in terms of weighted U-statistics 
(as sketched in the technical appendix to Hawkins et al., 1984). 

Although this brief statement may well have been sufficient for the pur- 
poses of Hawkins’s paper, it cannot be regarded as an adequate summary 
of the history of elemental set methods as it fails to mention that meth- 
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ods of this type had been used to estimate the parameters of linear and 
nonlinear models with two or more predictors for more than 240 years, or 
that the elemental set characterisation of the ordinary least squares (OLS) 
estimator has been known for more than 150 years. The purpose of the 
present paper is to provide a more comprehensive survey of the history of 
this subject. We shall also take the opportunity to mention some recent 
work on the least median of squared errors procedure. 


2 Nature of the fitting problem 


Some readers of this paper may be prompted to further their historical 
studies by consulting some of the original sources cited in it. Such readers 
will soon discover that the traditional notation and nomenclature of the 
Calculus of Observations is quite distinct from that of modern Mathemati- 
cal Statistics. To help the interested reader through this difficulty we shall 
employ a variant of this traditional usage in our history. 

In traditional notation the familiar curve or surface fitting problem may 
be expressed in the following terms: We are given a linear or nonlinear func- 
tion f(.) which is characterised by a set of p observed (variable) quantities 
a,b,c,etc. and a set of q unobserved fixed quantities (we would call them 
parameters) x,y, z,etc. We suppose that the variable quantities a, b, c, etc. 
are observed without error but that the value of the function f(.) is subject 
to an additive error. Denoting the observed value of the function by m and 
the corresponding value of the additive error by v, we find that we have a 
system of n equations: 


Mi = f (aj, bi, ci, etc.; x, y, z, etc.) +u; t1=1,2,...,” 


which describes the relationship between the observed values of the variable 
quantities and the corresponding observed values of the function. In this 
context we have to choose values for the q unknown quantities x,y, z, etc. in 
such a way that the observed errors (we would say residuals) v1, v2,..-,Un 
are as small as possible in some sense. 

Although this general statement of the problem admits the possibility 
of nonlinear functions, we shall be largely concerned with functions f(.) 
which are linear in the unknown constants. The major exceptions to this 
rule being found in the final paragraph of Section 3. | 

In passing we note that this traditional notation does not distinguish 
between the true and fitted values of the errors or between the true and 
fitted values of the unknown constants. This imprecision is of no imme- 
diate consequence as we shall only be concerned with fitted values in the 
remainder of this paper. 


Notes on the early history of elemental set methods 163 


3 Elemental set methods 


The most obvious solution to the general fitting problem outlined in Section 
2 is to discard n — q of the available equations and to use the remaining q 
equations to determine a set of values for the q unknown constants in such 
a way that these q equations are exactly satisfied with zero errors. This 
fitting procedure is variously known as the method of selected points, the 
subset selection method, and the method of elemental sets. 

Now there are "C, distinct ways of choosing q equations from a set of 
n equations, so that practitioners have either to choose a single set of q 
equations at random or according to some specified rule, or they have to 
select a greater number of sets and attempt to reconcile the discordant 
results obtained from their selection. 

In the earliest period of enquiry in this area, scientists employed a vari- 
ant of the first procedure in which the selection criterion is obscure, if not 
entirely hidden. Without making any attempt to explain their reasons, 
these authors arranged that there should be as many equations as were 
required for the problem to have a unique solution. For example, in his 
analysis of the problem of determining the height of a tree on a remote 
hillside, Liu Hui (third century) assumed that the surveyor had taken ob- 
servations on the elevation of the top of the tree from each of two locations 
in a horizontal reference plane and a third observation on the elevation of 
the base of the tree from one of these locations. These three observations 
together with the known horizontal distance between the two observation 
sites are sufficient to determine the three unknowns of the problem, see Li 
and Du (1987, pp.76—78) for details. However it is not explained why the 
surveyor should not have observed both elevations from both locations, or 
what was to be done if he had. It is intriguing to speculate on how Liu Hui 
would have responded if he had been challenged on this point. 

By the middle of the eighteenth century, this implicit choice of a single 
set of q equations had been replaced by a more explicit procedure. Mayer 
(1750, p.150), for example, suggested that one should choose a set of q 
equations that are typical of the n given equations. In the particular case 
of his determination of the position of the lunar crater Manilius, he obtained 
a system of n = 27 equations in g = 3 unknowns of the form 


m; = x + ysin(ð;) + zcos(@;) +v; i= 1,2,...,n 


and suggested that the q = 3 typical equations should be chosen in such 
a way that two of the angles differ from the third by 90 and 180 degrees 
respectively. However, having advanced this basic suggestion, he observed 
that one needs more than a single set of typical equations if one wishes to 
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check the accuracy of the original solution. In the limiting case one thus 
obtains a total of [n/q] distinct solutions from disjoint subsets of the data 
which then have to be reconciled with one another. 

Unfortunately Mayer did not discuss this technique in sufficient detail 
for us to be sure how many disjoint subsets of q equations he would have 
used, or how he would have attempted to reconcile the corresponding dis- 
cordant results. Nevertheless his lack of precision in this regard is entirely 
comprehensible as his exposition of this topic was as a brief aside to his 
main purpose of introducing a new fitting procedure known as the method 
of averages. 

A few years later, Boscovich addressed a similar problem relating to 
the ellipsoidal figure of the Earth in a similar way. In his contribution 
to Maire and Boscovich (1755) and again in Boscovich (1757), Boscovich 
was concerned with the solution of a system of n = 5 linear equations in 
q = 2 unknowns. In his first approach to this problem he evaluated all 
°C = 10 pairwise determinations of the unknown constants before taking 
an unweighted arithmetic mean. He was not satisfied with the result and 
tentatively suggested a variant which discards the two determinations with 
the smallest denominators before again taking an unweighted average of 
the remaining eight values. 

Boscovich was far from satisfied with the results he obtained from ei- 
ther variant of this procedure and, like Mayer before him, resolved the 
impasse by proposing an alternative fitting procedure. Boscovich’s alter- 
native procedure is to be found in his scientific notes to a poem in Latin 
hexameters by Stay (1760, pp.420—425), see Farebrother (1993). It chooses 
values for the unknown constants in such a way as to minimise the sum of 
the absolute values of the observed errors subject to the condition that the 
corresponding sum of the signed errors is zero. The relationship between 
this procedure and the method of elemental sets will be outlined in the 
following section. 

Finally, in this connection, we must observe that the method of elemen- 
tal sets was not entirely supplanted by more advanced methods (notably 
the method of least squares) until well into the present century. This state- 
ment is particularly true of nonlinear problems as Pearson (1902, p.298) 
mentions the possible use of the method of elemental sets to fit Makeham’s 
law (a generalisation of the Gompertz function) to actuarial data, and Yule 
(1925, pp.49-50) notes that this method produces acceptable results when 
fitting a logistic function to sufficiently smooth demographic data. Indeed, 
the more elementary statistical textbooks of the 1950s and 1960s still rec- 
ommended the method of elemental sets for the latter purpose, see Croxton 
and Cowden (1939; 1955, p.215). 
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4 Elemental set characterisations 


Boscovich provided a geometrical algorithm as an integral part of his solu- 
tion procedure. This algorithm was subsequently given an analytical form 
by Laplace (1793), see Farebrother (1993), Sheynin (1973), or Stigler (1986) 
for details. Some years later Gauss (1809, sec.186) generalised Boscovich’s 
optimality criterion to any number of unknowns and suggested that the 
adding-up constraint could be deleted. (In passing we note that this adding- 
up constraint has no direct connection with the method of least squares 
which was developed by Gauss and Legendre some years after Boscovich’s 
death.) 

In his discussion of the unconstrained least sum of absolute errors prob- 
lem, Gauss (1809, sec.186) notes that the optimal solution to this problem 
is characterised by a set of q zero errors and that the other n — q errors 
only help to determine this optimal set. He gives no justification for this 
result but a proof which would have been accessible to Gauss and his con- 
temporaries has been suggested by Waterhouse (1990). 

An explicit characterisation of the solution to the least squares problem 
as a weighted sum of elemental set determinations was established by Ja- 
cobi (1841). See Sheynin (1973) for an excellent description of this result 
which closely follows Jacobi’s own derivation. This result was subsequently 
rediscovered by Glaisher (1879), Subrahmanyam (1972), Hawkins, Bradu 
and Kass (1984), and Ben-Tal and Teboulle (1990) amongst others. The 
long interlude between the publication of the papers by Glaisher and Sub- 
rahmanyam would seem to be due to the intervention of a clear statement of 
Jacobi’s result in the popular textbook by Whittaker and Robinson (1924; 
1944, pp.251-252). 

This explicit characterisation of the solution to the least squares problem 
may be generalised to a weighted sum of least squares determinations from 
sets of m > q equations, see Sheynin (1993) and Wu (1986) for details. 
As a further generalisation of this result, Ben-Tal and Teboule (1990) have 
shown that, for all members of a class of strictly isotone functions which 
includes the weighted sum of the kth powers of the absolute errors (for 
some fixed finite positive value of k), every set of values for the unknown 
constants which minimises the chosen function of the errors will lie within 
the convex hull of the elemental set determinations. In addition they have 
shown that, for all members of a class of isotone (but not strictly isotone) 
functions including the largest absolute error and the median absolute error 
functions, at least one of the sets of values for the unknown constants which 
minimise the chosen function of the errors must lie within the convex hull 
of the elemental set determinations of these values. 
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Thus, although the solutions to the least sum of absolute errors (k = 1) 
and the least sum of squared errors (k = 2) problems must lie in the con- 
vex hull of the elemental set determinations, the solutions of the minimax 
absolute errors and least median of (squared) absolute errors problems will 
necessarily be members of the convex hull only if these solutions are unique. 

Further, although all the solutions of the weighted sum of kth powers 
(or other strictly isotone) problem must lie in a convex set for all sets of 
weights, the set of all such solutions need not itself be convex. Gilstein and 
Leamer (1983) have given a precise characterisation of the nonconvex set 
of solutions to the weighted least squares problem. 


5 Minimax absolute error criterion 


The class of (Gaussian) elemental set methods discussed in Sections 3 and 4 
may be associated with the optimal value of the sum of absolute errors cri- 
terion. A second class of (Laplacian) elemental set methods may similarly 
be associated with the optimal value of the maximum absolute error crite- 
rion. As its name implies, the minimax absolute error procedure chooses 
values for the unknown constants in such a way as to minimise the largest 
in absolute value of the n observed errors. This fitting procedure was first 
discussed by Laplace (1786). Given a set of n linear equations in q = 2 
unknowns, Laplace arbitrarily selected a set of r = q + 1 equations. Using 
any q of these equations to eliminate the q unknowns from the remaining 
equation, he obtained a single (reduced) equation with a linear function of 
the r observed errors on one side and a nonnegative constant on the other. 
Without further explanation, he asserted that the largest in absolute value 
of these r errors is minimised when all r errors take the same absolute value 
and their signs are given by the signs of the corresponding coefficients in 
the single reduced equation. 

A determinantal formulation of this procedure was subsequently devel- 
oped by de la Vallée Poussin (1911). In this alternative formulation of the 
problem one has to set the r selected errors proportional to the signs of the 


cofactors of m1, M2,..., Mp in the r x r determinant 
ai bı Ci wes Mi 
a2 bo C2 ..-. M 
Gs De Cr sss- My 


where, for notational simplicity, we assume that the trial solution is defined 
by the first r equations of the model. The values of the unknown constants 
and the common absolute value of the r errors are then obtained by applying 
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Cramer’s Rule to the corresponding system of r equations in r unknowns. 

Under either of these schemes we have to evaluate all n errors U1, v2,...,Un 
and select a new set of r equations if one or more of these errors is larger 
in absolute value than the common value of the r errors in the current 
set. Laplace (1786) apparently proposed to choose this new set without 
reference to earlier selections. By contrast, de la Vallée Poussin (1911) 
introduced an automated selection procedure which Stiefel (1960) subse- 
quently identified with his own (1959) procedure and both as equivalent to 
a standard simplex implementation of the linear programming dual formu- 
lation of the minimax problem. 

In this context it is interesting to note that Farebrother (1985) has shown 
that the minimax absolute error procedure is closely related to the linear 
programming dual formulation of the least sum of absolute errors problem. 
Also see Farebrother (1997) for a detailed account of the early history of 
the minimax absolute error procedure with particular reference to the work 
of de Prony (1804) and Fourier (1827). 


6 Median squared absolute error criterion 


These results on the minimax absolute error procedure have recently come 
to prominence as Rousseeuw’s (1984) so-called least median of squares pro- 
cedure actually chooses values for the unknown constants in such a way as 
to minimise the median or middlemost value of any increasing function of 
the absolute values of the observed errors. For example, suppose that we 
wish to minimise the Ath largest squared error where h is set close to n/2. 
Then, conditional on the choice of the h — 1 equations which are to be ig- 
nored, the least median of squared errors problem may be expressed in the 
form of a minimax absolute error problem applied to the n— h+ 1 retained 
equations. The optimal solution to the least median of squared errors prob- 
lem is thus also characterised by a (Laplacian) elemental set determination 
of the unknown constants. In principle, we may therefore determine the 
optimal values of these q constants by evaluating the median squared er- 
ror function for a sufficiently large sample of the minimax absolute error 
determinations of the q unknowns from sets of q + 1 equations. 

However, it is clear that the minimax fitting of a system of q + 1 equa- 
tions in q unknowns is vastly more expensive than the direct solution of 
a set of g equations in g unknowns. Rousseeuw and Leroy (1987) have 
therefore suggested that a sufficiently accurate approximation to the ex- 
act least median of squares solution may be obtained by evaluating the 
median squared error function for a sufficiently large sample of Gaussian 
elemental set determinations. This conjecture was subsequently confirmed 
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by Stromberg (1993) whilst Hawkins (1993) has shown that this technique 
yields satisfactory results for a wide class of robust fitting procedures. 
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Abstract: Similar to standard quantile regressions, the censored quantile 
regression estimate interpolates some data points. This paper discusses 
the algorithms used in empirical research in light of this interpolation 
property and compares their performance in a simulation study. The 
results show that the ranking between algorithms differs depending on 
the criterion used. The algorithm BRCENS, suggested by the author in 
the past, performs best in terms of the frequency that the exact censored 
quantile regression estimates are obtained, it is very competitive in terms 
of the computation times required and its performance can be noticeably 
improved, when trying out various starting values. However, BRCENS is 
not optimal in terms of the root-mean-squared deviation of the coefficient 
estimates, indicating a high skewness of the distribution of the deviation 
from the exact estimates. Overall, BRCENS can be recommended for 
moderate degrees of censoring, whereas all practical algorithms perform 
quite poorly when a lot of censoring is present. 


Key words: Censored quantile regression, algorithms, BRCENS, ILPA, 
NLRQ. 
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1 Introduction 


Censored quantile regressions (CQR’s), introduced by Powell (1984, 1986), 
are an attractive approach to the estimation of the censored regression 
model with fixed known censoring points. First, compared to Tobit maxi- 
mum likelihood estimation, cf. Amemiya (1985, chapter 10), CQR’s provide 
consistent estimates under far weaker distributional assumptions. Second, 


*Buchinsky (1997, section 8) and Fitzenberger (1997) provide general guides to CQR’s. 
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CQR’s allow us to model the conditional quantiles of the dependent variable 
as a function of the regressors, cf. Buchinsky (1994), Chamberlain (1994), 
or Fitzenberger et al. (1995) modelling the conditional wage distribution. 

The computation of the CQR estimates involves minimizing a non- 
differentiable and non-convex distance function. When introducing the 
censored least absolute deviation (LAD) regression (i.e. the CQR for the 
special case of the median), Powell (1984) suggested generic optimization 
routines, which do not take account of the special characteristics of the 
problem.” Paarsch (1984) even resorted to grid search when evaluating the 
finite sample properties of censored LAD regression. Womersley (1986) an- 
alyzed the numerical properties of censored LAD regression for the first time 
and suggested an algorithm using a finite direct descent method. However, 
to my knowledge, his code is not available to applied researchers. 

More recently, Buchinsky (1994) suggested an Iterative Linear Program- 
ming Algorithm (ILPA) involving an iteration of the Barrodale—Roberts— 
Algorithm (BRA) first developed for standard LAD regression. However, 
ILPA is not guaranteed to converge and convergence does not guarantee a 
local minimum of the CQR optimization problem. Building on the char- 
acterization of the CQR estimates by the interpolation property presented 
in the following, Fitzenberger (1994) developed the algorithm BRCENS as 
an adaptation of the BRA guaranteeing convergence to a local minimum. 
Koenker and Park (1996) developed a general interior point algorithm for 
nonlinear quantile regression problems (NLRQ), which they apply to the 
CQR case. The simulation studies in Fitzenberger (1994, 1997) show that 
BRCENS performs best in comparison to ILPA and NLRQ in terms of the 
frequency that the exact global minimum of the CQR optimization prob- 
lem is obtained. However, the simulation studies show that all algorithms 
perform quite poorly in the presence of a lot of censoring. 

The purpose of this paper is to give an overview on the computational 
aspects of CQR’s and to provide more extensive simulation evidence on the 
performance of various algorithms currently in use. In the remainder of 
this section, I present the interpolation property characterizing the CQR 
estimates. Section 2 describes various algorithms in detail and presents 
modified versions of the algorithms BRCENS and NLRQ. Section 3 ex- 
tends the available evidence from simulation studies. Summarizing the 
simulation results, BRCENS performs quite well in comparison and it can 
be recommended for moderate degrees of censoring relative to the quantile 


*Following Powell’s suggestion, Horowitz and Neumann (1987) present an empirical 
application based on the Nelder and Mead (1965) algorithm. 

*Koenker and d’Orey (1987) provide an extension of the BRA to the general quantile 
regression case. For the latter, the following discussion refers to this extension. 
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considered. However, in the presence of a lot of censoring all algorithms 
perform quite poorly. 


1.1 Censored quantile regression and interpolation 
property 


Introducing some notation, for a sample of size N, let the dependent vari- 
able be the N x 1 vector, y = (y1,...,yn), the design matrix be the N x k 
matrix X = (z1, ... £N), with xj = (£i 1, -.-, Zik), the N x 1 vector of fixed 
known observation specific censoring values be yc = (yci, ..., yew), the N x1 
vector of disturbances be € = (€1,...,€y)/ and the k x 1 parameter vector 
be B. 

The following discussion considers a censored regression model with cen- 
soring from above. For a given quantile 0 € (0,1), the CQR estimation 
problem? is to minimize the piecewise linear distance function given by 


N 
Bo E argmin X  sgnelyi — min[xi8,yci]) - (ys — min[2}B,yci]) , (1) 
GB i=l 


where the 0 weighted sign function is given by 
sgno(e;) = OI (e; > 0) =a (1 = 0)T(€; < 0) 


and I(.) denotes the indicator function. The expression 2,49 captures the 
-quantile of the underlying uncensored dependent variable conditional on 
Ti. 

Since the CQR distance function (1) is piecewise linear, the CQR mini- 
mization problem does not necessarily have a unique solution.” Analogous 
to standard quantile regressions, cf. Koenker and Bassett (1978), the fol- 
lowing interpolation property can also be established for CQR’s. 


Interpolation Property: If the 2 design matrix X has full rank k, then 
there exists a global minimizer ə of the CQR distance function such 
that Bg interpolates at least k data points, i.e. there are k observations 


{Yin Zi), --- (Yip Zip) } with 
(IP) Yi = z, Bg for l =1,...,k and the rank of (2j,,...,2;,)’ equals k. 


“Fitzenberger (1997) treats censoring both from above and below and provides the 
asymptotic distribution of the CQR estimator. The expression “sgno(ei) - €:? is mostly 
referred to as “check function” po(eé:). The notation used here has advantages when 
studying the asymptotic distribution of the estimator. 

5Womersley (1986, p. 112) and Fitzenberger (1994) provide a more complete charac- 
terization of the set of minimizers. 
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When evaluating the IP, the following three points deserve attention. First, 
if the CQR distance function exhibits a unique minimizer, it must satisfy IP. 
Second, the CQR can interpolate a censoring point where an observation is 
censored. And third, in contrast to standard quantile regressions, Koenker 
and Bassett (1978), it is not guaranteed that a share of at most 0 [(1 — 
0)] of the observations lies above [below] the estimated CQR line with an 
intercept. 

The IP is established by analyzing the kinks of the piecewise linear dis- 
tance function (1) for which the directional derivative proves an important 
tool. The directional derivative evaluated at 8 in direction w € RË is given 
by 

N 
Hy(8,w) = > [I (258 < yei){—sgno(ys — £48) — I(£;B = y:)sgne(—aiw)} 


i=l 
(2) 
+I(x,8 = ya (1 — OI (yi < you, cw < 0) — OL lyi = ye, xiw < 0)}]- aw . 


2 Algorithms 


Whereas previous simulation studies relied on grid search to determine the 
CQR estimates, the interpolation property (IP) discussed above suggests 
an enumeration algorithm to determine the CQR estimate exactly, i.e. an 
element out of the set of global minimizers. This algorithm, which I denote 
by IPOL, consists of an enumeration of the set of all k-tuples of data points 
with linear independent regressor vectors and the corresponding interpolat- 
ing regression line. Then the ones minimizing the CQR distance function 
are in the set of global minimizers. IPOL involves the evaluation of at 
most e ) k-tuples. In contrast to grid search, IPOL guarantees to find a 
global minimum exactly and is typically much faster than grid search. The 
computational advantage of IPOL relative to grid search increases with the 
required accuracy of the estimates and the number of the regressors k and it 
decreases with the number of observations N. The algorithms discussed in 
the following, which are typically much faster than IPOL, will be contrasted 
in Section 3 with the exact CQR estimates obtained by IPOL. 


2.1 BRCENS 


The algorithm BRCENS is developed in Fitzenberger (1994) as an adap- 
tation of the standard Barrodale—Roberts—Algorithm (BRA) for standard 
Quantile Regressions to the Censored Quantile Regression case. A stan- 
dard quantile regression exhibits a linear programming structure. Bar- 
rodale and Roberts (1973) notice that the IP allows for a more efficient, 
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condensed simplex approach. Only kinks of the distance function need to 
be considered for which k (design matrix exhibits full rank) observations are 
interpolated and for which the rank of the matrix formed by the regressor 
vectors is equal to k. 

Analogous to BRA, the algorithm BRCENS involves two parts. First, 
the algorithm starts with all coefficients being zero as the set of nonbasic 
variables (NB). The algorithm proceeds in k steps, where in each step 
that coefficient from NB changes into a direction for which the directional 
derivative (2) indicates the strongest decline of the objective. This defines 
a, one-dimensional search direction and the coefficient is changed along this 
direction until the objective starts increasing again. At this point, accord- 
ing to the expression for the directional derivative in (2), there is at least 
one data point being interpolated. One of these interpolated data points 
now replaces the coefficient leaving NB. At the end of the first part, the 
algorithm has reached a situation where IP is satisfied. The second part 
of BRCENS considers exchanging one of the interpolated observations in 
NB with a different observation. With all other data points in the NB re- 
maining interpolated, considering one data point defines a one-dimensional 
search direction. The algorithm keeps moving into the search direction, for 
which the directional derivative indicates the largest decline of the objec- 
tive, until the objective cannot be reduced further. At this new point there 
exists at least one data point which is added to NB, the set of interpolated 
data points. The algorithm stops when the directional derivatives for all 
interpolated data points in NB are non-negative, thus guaranteeing that a 
local minimum has been achieved.’ 

BRCENS does not guarantee convergence to a global minimum of the 
CQR objective function. The contribution to the directional derivative be- 
comes zero at data points 2, for which the current estimate of the CQR 
line yields a strictly censored fitted value, i.e. 2/8 > yc; at the current $. 
Therefore, the simulation study considers a heuristically modified version 
of BRCENS, denoted by MBRCENS, which uses different starting values. 
MBRCENS starts with BRCENS. Based on the set of coefficients obtained, 
new starting values are used which tend to put the current CQR line out of 


°In the Simplex setup, the set of nonbasic variables comprises those coefficients and 
residuals (including the corresponding data points) which are currently zero and for 
which the Simplex tableau provides the linear representation in terms of the variables 
in the basis. The variables in the basis (coefficients, data points represented by the 
corresponding residuals) are typically different from zero. 

"In contrast to the original version of BRCENS, the current version performs the final 
directional derivative check for all combinations of k interpolated data points with rank 
k as NB. This change improves the reported performance relative to earlier simulation 
studies, cf. Fitzenberger (1994, 1997). 
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the censored region. The estimates from the first step are used as starting 
values for BRCENS in the second step except for the intercept coefficient 
being shifted to the 25%—quantile of the estimated residuals. Again the esti- 
mated coefficients from the second step are modified such that the intercept 
is now shifted to the 40%—quantile of the estimated residuals and the slope 
coefficients are multiplied by 0.8 yielding the starting values for a third 
round with BRCENS. Among the three available estimates, MBRCENS fi- 
nally takes the one yielding the lowest value of the CQR objective function. 


2.2 Iterative linear programming algorithm 


Buchinsky (1994, p. 412) suggests the following Iterative Linear Program 
ming Algorithm (ILPA), which has also been applied in Honoré and Po- 
well (1993). ILPA consists of successions of standard quantile regressions. 
Starting with an initial coefficient estimate By and a counter j = 1, the 
following iterative steps are continued until either convergence is achieved 
or a maximal number of iterations is reached: | 


Step 1: For the jt? iteration, determine the set M; of observations with 
xB; < yc. If j = 1 or M; 4 Mj-_1 then continue with Step 2, otherwise 
terminate and take Gg = 6;_1 as the CQR estimate. 


Step 2: Calculate B; as the standard quantile regression estimate for the set 
of observations M; by means of the BRA. Set j := 7+ 1 and repeat Step 1. 


Buchinsky states that ILPA is not guaranteed to converge. He motivates 
ILPA by the following two claims. First, for an optimal solution $9 of the 
CQR problem, the set of observations for which the predicted value lies on 
or above the censoring point, i.e. x! Bo > yc;, could have been excluded 
from the estimation and one would still obtain the same estimate. And 
second, when convergence is reached, the coefficient estimate represents a 
local minimum of the CQR distance function. 

In Fitzenberger (1994), I provide counter examples showing that there 
exist both designs for which the CQR estimate interpolates a censoring 
point and designs for which convergence of ILPA does not result in a local 
mi-nimum. The main issue is that the contribution of a single observation 
to the directional derivative (2) changes when the regression line hits the 
censoring point depending on whether the observation itself is censored and 
depending on whether the regression line moves towards or away from the 
censored region. 

However, under the assumption that the exact CQR estimate does not 
interpolate any censored observation, Buchinsky’s rationale is true following 
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from results on standard quantile regressions. Therefore, ILPA has a lot of 
appeal, since asymptotic consistency of the CQR estimate relies on the fact 
that population quantiles can be estimated consistently as long as they are 
uncensored, cf. Powell (1984, 1986). 

There exists an alternative version of ILPA, which coincides with ILPA 
except that in step 1 “<” is replaced by “<” when defining the set M;.8 
This allows for the CQR estimates to interpolate censored observations, but 
again convergence does not guarantee a local minimum, cf. Fitzenberger 
(1994). Since earlier simulation studies showed that this modification does 
not lead to an improvement relative to the version of ILPA presented first, 
it is not considered further in this paper. 


2.3 NLRQ 


The algorithm NLRQ (“Nonlinear Regression Quantile”) developed in 
Koenker and Park (1996) is a generic interior point algorithm for non- 
linear quantile regressions defined by minimizing the distance function 
SA sgnolyi — fi(xi, B)] [yi — fi(xi, B)| with respect to B. The algorithm 
is built on fi(z;,) being differentiable in 6 almost everywhere. NLRQ 
considers successions of linearized quantile regression problems and at each 
succession the algorithm performs two steps. First, it considers the dual 
problem to obtain a one-dimensional search direction by interior point 
methods. Second, the search direction from the dual problem is translated 
into a search direction in the primal nonlinear problem and a conventional 
one-dimensional line search is performed. After the two steps, the f: s and 
their gradients are updated. The algorithm stops when the new iterate fails 
to improve the objective function. Koenker and Park provide an S—code, 
cf. Becker et al. (1988), for NLRQ. For the subsequent simulation study, I 
have translated their code into Fortran. This makes the timing comparison 
somewhat unfair for NLRQ since I use grid search for the line search in the 
second step. 

Considering the use of NLRQ for CQR’s, one should note that the fi(.)— 
function is not everywhere differentiable in 6. At points satisfying the 
IP there could be observations for which the regression line interpolates 
a censoring value involving a kink in f:(.), see the expression for the di- 
rectional derivative in (2), i.e. f;(.) is not differentiable here. Therefore, I 
also consider a modification of NLRQ (denoted as MNLRQ) in the subse- 
quent simulation study, which takes account of the fact that the directional 
derivative differs depending on the direction taken when the current CQR 


8In contrast to Buchinsky (1994, p. 412), Buchinsky (1997, section 8.1) refers to this 
version as the ILPA. 
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interpolates a censoring point. In such a situation, MNLRQ tries all pos- 
sible permutations of the contribution of interpolated censoring points to 
the directional derivative when forming the gradient until a search direc- 
tion is found along which the CQR objective function can be improved.’ 
MNLRQ stops when the CQR objective function cannot be improved fur- 
ther resulting in a local minimum of the CQR optimization problem. It has 
to be emphasized, that MNLRQ is not constructed to be a serious com- 
petitor in terms of computation time. In fact, with a lot of censoring in the 
data, my current implementation of MNLRQ performs very poorly in this 
respect, cf. Section 3. The goal here is rather to explore whether NLRQ 
could be improved by considering more precisely the directional derivative 
information at censoring points. 


3 Simulation results 


This section analyzes the performance of the algorithms described in Sec- 
tion 2 by means of a simulation study whose design is similar to Fitzen- 
berger (1994, 1997). Table 1 describes the data generating processes (DG- 
P’s), (A)-(H). For each scenario, 1000 random samples of size 100 are 
drawn. The estimation problem is a censored LAD regression with one re- 
gressor and an intercept. A sample is dismissed if the exact CQR estimates 
(determined by the enumeration algorithm IPOL) are not unique. 


DGP Censoring Values True Coefficients Regressor Values 
(A) yc; = Const (G1, G2) = (0,0) ti ~ N(0,1) 

(B) yc: = Const (G1, 82) = (0,0) Ti 2 = —9.9+0.2-% 
(C) Yi = Const + 0.5 (81, B2) = (0.5, 0.5) Li2™ N(O, 1) 

(D) ye = Const + 0.5 (81, 82) = (0.5, 0.5) zi 2 = —9.9 + 0.2. i 
(E) yci ~ N(Const,1) (81, 82) = (0,0) zi2 ~ N(0,1) 

(F) ye: ~ N(Const, 1) (Bı, 82) = (0,0) £i2=—-9.94+0.2-1 
(G) ye; ~ N(Const + 0.5, 1) (81, G2) = (0.5, 0.5) tio ~ N(0,1) 

(H) ye ~ N(Const + 0.5, 1) (G1, G2) = (0.5, 0.5) rig = -9.94+0.2-% 


a) Const denotes some constant taking various values, N(Const, 1) denotes the nor- 
mal distribution with mean Const and variance one, and I(.) denotes the indicator 
function. The random variables €; are distributed as i.i.d. N(0,1) andi=1,...,N 


Table 1: Data generating processes (DGP) (A) — (H) used in simulation 
study for the model y; = min(yc;, G1 + B2- Ti2 + €i) ®. 


°In cases, when there are too many interpolated censored observations (more than 
4 x k), each component of the current coefficient vector is perturbed by a very small 
number. This is done to limit the number of permutations to be explored. 
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The DGP’s differ in four dimensions. First, by whether the coefficients 
to generate the data are both 0 (A,B,E,F) or both 0.5 (C,D,G,H). Since 
all algorithms start with both coefficients at 0, it could make a difference 
whether the starting values are close to the truth. Second, the DGP’s differ 
by whether the censoring points are the same for all observations (A,B,C,D) 
or differ across observations (E,F,G,H). In the first case, if the CQR line is 
above the censoring value for a certain regressor value then this is true for all 
regressor values in a certain neighborhood. In the second case, the CQR line 
can be censored and uncensored for the same regressor values. This could 
have an influence on the directional derivative information used locally. 
Third, the DGP’s differ by whether the regressor is a random variable 
(A,C,E,G) or a fixed sequence of numbers (B,D,F,H). The two scenarios 
differ by whether all observations exhibit the same a priori distribution. 
And fourth, the DGP’s differ by the degree of censoring depending on 
Const = 1,0.5,0. Table 2 shows the average share of censored observations 
depending on the DGP and the choice of Const. Const = 0 represents a 
situation where on average 50% of the observations are censored, i.e. the 
exact CQR (0 = 0.5) typically reaches the censored region. 


DGP Const = 1.0 Const = 0.5 Const = 0.0 


(A) 15.9 30.9 49.7 
(B) 15.9 30.9 49.8 
(C) 18.7 32.9 49.9 
(D) 41.0 46.0 50.9 
(E) 24.1 36.4 50.1 
(F) 24.1 36.3 50.0 
(G) 25.4 37.1 50.0 
(H) 41.0 46.0 51.0 


Table 2: Average share of censored observations in random samples for 
various data generating processes (DGP) -— In percent. 


Convergence: Table 3 presents the absolute frequencies that an algorithm 
converges. For ILPA, the algorithm is terminated after 20 iterations, if no 
convergence is achieved. Increasing this number to 100 in some scenarios 
did not change the results. In such a case, the best coefficient vector during 
these 20 iterations (in terms of the CQR distance function) is taken as the 
ILPA estimate.!? MNLRQ and NLRQ are considered not to have converged 


10Tn an earlier simulation study, I used the final coefficient estimate after 20 iterations, 
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BRCENS, MBRCENS, and NLRQ converged under 


all DGP’s for all 1000 random samples 
DGP Const = 1.0 Const = 0.5 Const = 0.0 


ILPA MLRQ ILPA MNLRQ ILPA MNLRQ 


(A) 999 1000 925 1000 537 707 
(B) 1000 1000 962 1000 508 1000 
(C) 752 1000 494 1000 298 989 
(D) 662 1000 656 1000 642 1000 
(E) 750 1000 618 1000 485 1000 
(F) 747 1000 641 1000 506 1000 
(G) 727 1000 621 1000 526 1000 
(H) 726 1000 720 1000 710 1000 


Table 3: Absolute frequencies among 1000 random samples that algorithms 
converged. 


after 200 iterations. BRCENS, MBRCENS, and NLRQ converge for all 
scenarios. Convergence is a serious problem for ILPA, especially with a 
high degree of censoring (Const = 0), with bad starting values (C,D) or 
with random censoring points. When ILPA does not converge, it typically 
oscillates between two or three coefficient vectors.'! Additional results 
(not reported here) indicate that along the iterations ILPA reaches a local 
minimum as the best coefficient estimate in almost all cases for Const = 
1,0.5 and in at least 76% of all cases for Const = 0. MNLRQ converges in 
all cases with low or moderate censoring (Const = 1,0.5) but it exhibits 
some convergence problems in the presence of a lot of censoring (Const = 0, 
DGP (A) and (C)). Overall, lack of convergence is a very serious drawback 
for the application of ILPA. 


Optimality: Table 4 is concerned with the frequencies that the global 
minimum of the CQR distance function is achieved where the latter (the 
exact CQR estimate) is obtained by the enumeration algorithm IPOL. An 
algorithm is assumed to have achieved the optimum, if the value of the ob- 
jective at the solution is within a tolerance of 1077 to the value of the exact 


cf. Fitzenberger (1994). 

111f ILPA does not converge, it must oscillate, since a finite sample allows only for 
a finite number of subsamples which a standard quantile regression can be based upon. 
Oscillation arises, since an observation, for which the current standard quantile regression 
implies a censored fitted value, still contributes to the distance function. In the next 
iteration, this observation is excluded from the sample, which can result in a new estimate 
for which the fitted value at the aforementioned observation is now uncensored. 
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CQR estimates. The results are very favorable for BRCENS in comparison. 
The relative performance of BRCENS is better, when the degree of censor- 
ing is higher and when there are more observation specific censoring points. 
However, for large degrees of censoring, all algorithms (including BRCENS) 
perform very poorly. Given that BRCENS, MBRCENS, and MNLRQ only 
guarantee convergence to a local minimum, this poor performance is to 
be expected, since the CQR distance function is already highly noncon- 
vex with moderate censoring. Unfortunately, a local minimum does not 
guarantee a solution close to the global minimum, see also the following 
results on the properties of the coefficient estimates. The modified ver- 
sion MBRCENS yields a substantial improvement compared to BRCENS. 
ILPA performs better than NLRQ. Again the modified version, MNLRQ 
improves upon NLRQ and performs better than ILPA in many cases. All 
algorithms perform quite satisfactorily with moderate degrees of censoring, 
a common censoring point, and good starting values, DGP (A) and (B) 
and Const = 1. 


Properties of Coefficient Estimates: The quality of the coefficient esti- 
mates obtained by various algorithms is an important issue, which has been 
mostly neglected in my previous simulation studies. Therefore, Tables 5 and 
6 also provide results on the root~mean-squared deviation and the 90%- 
percentile of the absolute deviation of the respective estimates from the 
exact CQR estimates for moderate degrees of censoring (Const = 0.5).!? 
According to the root—-mean-squared deviations criterion, BRCENS and 
ILPA exhibit almost the same performance and NLRQ performs notice- 
ably better. This is in contrast to the optimality results presented be- 
fore. Considering results for Const = 0 (not reported here) reverses the 
relative performance of BRCENS and NLRQ (BRCENS also outperforms 
ILPA in this case).!? The modified algorithm MBRCENS improves upon 
BRCENS, whereas there is no clear ranking between MNLRQ and NLRQ. 
MBRCENS performs slightly worse than MNLRQ and NLRQ. Turning to 
the 90%-percentiles, the numbers are considerably smaller than for the 
root-mean-—squared—deviations, however, the relative performance between 
the algorithms is only slightly changed (MBRCENS performs better than 
NLRQ/MNLRQ in a considerable number of cases). Overall, these findings 


12Further results for Const = 1,0.5 and results on the median and the 75%-—percentile 
are available from the author upon request. 

13 At this point, it remains a topic for further research to investigate why NLRQ proves 
less outlier sensitive than BRCENS or ILPA for moderate censoring. It seems to pay 
off that interior point methods do not move too close to the contraint set (in linear 
programming terminology) in an early stage of the iteration process. Some local minima 
on the contraint set can actually imply quite extreme coefficient values. 
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indicate that for all algorithms the distribution of the absolute deviation 
from the exact CQR estimates is very much skewed to the right. This effect 
is strongest for BRCENS and ILPA. 


DGP ILPA BRCENS MBRCENS NLRQ MNLRQ 


Const = 1.0 
(A) 993 995 995 829 983 
(B) 1000 1000 1000 863 995 
(C) 655 864 908 445 705 
(D) 526 843 938 332 633 
(E) 655 866 904 387 499 
(F) 642 874 932 400 487 
(G) 630 855 913 229 454 
(H) 628 874 958 426 646 

Const = 0.5 
(A) 873 936 942 730 870 
(B) 924 942 958 782 934 
(C) 353 687 745 219 467 
(D) 526 850 932 361 586 
(E) 501 818 861 283 397 
(F) 511 802 876 310 393 
(G) 511 804 884 222 433 
(H) 599 837 936 451 639 

Const = 0.0 
(A) 90 330 379 2 ri 
(B) 81 174 408 53 149 
(C) 159 474 596 100 261 
(D) 487 806 911 358 592 
(E) 332 702 774 197 293 
(F) 374 720 845 189 296 
(G) 391 706 809 150 339 
(H) 603 826 942 465 646 


Table 4: Absolute frequencies among 1000 random samples that algorithms 
achieved global optimum of CQR distance function. 


Timing: Table 7 provides the relative average CPU time requirements 
of the different algorithms depending on the degree of censoring in the 
data (Const). Timing is of particular importance when bootstrapping the 
CQR estimates. The results show that BRCENS exhibits the lowest time 
requirements. ILPA and MBRCENS are next with no clear ranking between 
the two. In comparison, NLRQ and MNLRQ exhibit a much larger time 
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DGP ILPA BRCENS MBRCENS NLRQ MNLRQ 
Const = 0.5 — Estimates for 81 


(A) .565 .066 566  .475 .440 
(B) 6.409 6.409 6.409 6.408 6.408 
(C) 181 179 .175  .141 .146 
(D) .113 .108 .075  .075 .078 
(E) .045 .036 .033 .029 027 
(F) .048 043 039  .034 .038 
(G) 051 045 037 .037 .030 
(H) 121 114 058  .073 079 
Const = 0.5 — Estimates for Go 
(A) 309 .309 309  .260 247 
(B) .662 .662 662  .662 .662 
(C) 162 .156 155  .110 122 
(D) 018 .018 012 012 013 
(E) 051 048 046 ~=.037 036 
(F) .007 .007 006  .005 .006 
(G) .050 049 041 ~ .037 .030 
(H) 019 018 009 .013 013 


Table 5: Root-mean-squared deviation of coefficient estimates from exact 
CQR estimates. 


requirement. For a low degree of censoring, BRCENS is about 60 times 
faster than the exact determination of the CQR estimates by means of 
IPOL. This advantage is reduced to a factor of 10 when a lot of censoring is 
present. In comparison, the incremental time requirement for MBRCENS is 
fairly small, whereas NLRQ and MNLRQ are considerably more expensive. 
For the case with a lot of censoring, MNLRQ even uses more time than 
IPOL. However, for fairness sake, it has to mentioned that my Fortran 
implementation of NLRQ and MNLRQ is likely to be somewhat inefficient. 
Given the poor performance of all other algorithms in the presence of a lot 
of censoring, IPOL appears a viable alternative in such a situation. 


Summary and Recommendations: Summarizing the simulation re- 
sults, BRCENS performs quite well in comparison. It outperforms ILPA 
and NLRQ with respect to the frequencies that the exact CQR estimates are 
reached. BRCENS is very competitive in terms of the computation times 
involved and when trying out different starting values its performance can 
be improved considerably at a fairly small computational cost. All algo- 
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DGP ILPA BRCENS MBRCENS NLRQ MNLRQ 
Const = 0.5 — Estimates for (3; 


(A) .000 .000 000 .013 .000 
(B) .000 .000 000 .011 .000 
(C) .276 .256 .243  .054 .030 
(D) .239 221 .000  .040 .027 
(E) .108 .072 .046 .049 .051 
(F) .097 .084 .022 .057 .051 
(G) .099 .091 .030  .080 .057 
(H) .267 213 000 = .050 029 
Const = 0.5 — Estimates for Go 
(A) 000 000 000 .013 .000 
(B) .000 000 000  .002 .000 
(C) 324 .267 259  .073 045 
(D) 039 034 000 .014 .008 
(E) .112 .084 .050 .051 .050 
(F) .014 .017 .007 .010 .010 
(G) .096 .112 .032 .091 .055 
(H) .043 -2036 .000 8.012 .007 


Table 6: 90%—quantile of absolute deviation of coefficient estimates from 
exact CQR estimates. 


rithms perform the worse, the higher the degree of censoring. Based on the 
simulation results reported here and recognizing that BRCENS guarantees 
convergence to a local optimum, its application can be highly recommended 
in situations with low or moderate degrees of censoring (relative to the 
quantile 0 being estimated). For high degrees of censoring, one might want 
to determine the exact CQR estimates by means of IPOL. The results for 
NLRQ show that its performance can be improved when using more precise 
directional derivative information. However, this could result in consider- 
ably higher computation cost (or lack of convergence, when only a small 
number of iterations is allowed for). | 
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Const IPOL ILPA BRCENS MBRCENS NLRQ MNLRQ 


1.0 66.24 1.77 1.00 1.69 24.50 21.90 
0.5 64.19 2.12 2.05 2.69 27.25 24.19 
0.0 61.96 6.86 6.41 7.27 25.50 700.87 


a) The reported numbers are the ratios of average computation times across 
DGP’s (A)-(H) for different degrees of censoring (Const = 1,0.5,0), cf. ta- 
ble 2, relative to BRCENS, Const = 1. The time results are obtained with 
the UNIX ’time’ command on an IBM RS 6000 workstation, based on a 
Fortran implementation of the various algorithms. 


Table 7: Average computation times”. 
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Making the Laplacian Tortoise faster 
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Abstract: In “The Gaussian Hare and the Laplacian Tortoise”, the au- 
thors present a two-pronged attacked on the computation of Lı and other 
regression quantile estimators in linear models for large samples. The 
first prong involves the application of interior point linear programming 
methods, specifically designed to treat the absolute error and related re- 
gression quantile objective functions. The second prong applies a form of 
stochastic preprocessing, somewhat reminiscent of the O(n) algorithms 
for computing the median of a single sample. These ideas provide com- 
putational methods that are in theory faster than least squares as n — oo 
(with probability tending to one), and in practice are faster than Splus 
least squares functions for n larger than 10* (and the number of param- 
eters moderate). Here some issues concerning this algorithm are consid- 
ered, and some improvements are proffered. 


Key words: Linear models, regression quantiles, Lj-estimation, compu- 
tation. 
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1 Introduction 


Consider the standard linear model: there are observations {Y; : i = 
1, ..., n}, satisfying 


Yate i= leen (1) 


where zx; are vectors in R?, 6 € R? is a vector of unknown parameters, 
and {u;} form an i.i.d. sequence of errors. Though we consider the model 
conditionally on {x;}, we will generally assume that these design vectors are 
realizations of an independent random process. The traditional approach to 
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statistical analysis of (1) is to use least squares to estimate the conditional 
mean of Y given x: E|Y|a] = 2’G. However, this provides an analysis 
only of the center of the conditional distribution. To get a more complete 
picture of the relationship between Y and x, Koenker and Bassett (1978) 
introduced regression quantiles as the solution to the problem: for each 
T € [0, 1], let G(r) achieve 


n 
: / 
min 2 ln — xb) (2) 
where p (u) = Tut + (1 — r)u™. When 7 = .5, G(.5) is just the usual L4 
estimator, which corresponds to the conditional median estimator. Under 
model (1) (and more generally), the line y = x'3(r) estimates the condi- 
tional quantile of Y given x. These methods have been applied successfully 
in a wide variety of examples. 

Computation of regression quantile estimators has depended on the 
recognition that the minimization problem (2) is equivalent to a linear 


program, viz.: 
n 


min Ze aLe) (3) 


subject to 
Y,=a,b+u—9, u20, y>0 i=l aan 


This equivalence was first presented in the literature in the mid 1950’s. 
However, it is not unlikely that it was known earlier but dismissed as hav- 
ing little or no computational implication until the power of the simplex 
algorithm was appreciated. In fact, in the early 19th century Gauss (1809) 
already recognized that the Lı estimator could be characterized as having 
p zero residuals. As described in Portnoy and Koenker (1997), efficient 
algorithms based on Danzig’s simplex algorithm were developed, and they 
proved quite effective for sample sizes n < 1000 (or so). However, these 
algorithms were extremely slow for sample sizes significantly larger than 
1000. Thus, Portnoy and Koenker(1997) were led to develop a new com- 
putational approach based on two fundamental ideas. 

The first idea involved replacing simplex approaches with interior point 
methods, which originated in the mid 1980’s and have been under extremely 
active development since then. The traditional simplex method is based on 
the idea that the constraint set for a linear programming problem is a 
simplex: that is, a convex set defined as the set of convex combinations 
of a finite number of extreme points (viz., the vertices of the constraint 
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set). In the regression quantile problem, the vertices are just the G-values 
defined as having p zero residuals (which are often called “elemental’ solu- 
tions). One proceeds iteratively by evaluating the linear objective function 
at a vertex and finding the adjacent vertex in the direction of steepest de- 
scent of the objective function. The solution is found after a finite number 
of such steps (called “pivots”) when the algorithm reaches a point where 
no descent direction remains. Unfortunately, in large sample regression 
quantile problems, the number of vertices is of order n?, and this tends 
to require the algorithm to pass through a very large number of vertices 
before reaching the solution. Fortunately, under moderate distributional 
assumptions in model (1), it is possible to show that with probability tend- 
ing to one, each pivot moves a fraction 1/n towards the solution. Thus, 
since each pivot takes O(np) operations, the simplex algorithm might be 
expected to take O,(n”p*) operations (much larger for large n than the 
O(np*) operations required for least squares algorithms). In fact, this rate 
should be reducible to O,(n3/*p?), since one can generally obtain a initial 
estimate within O(n-V 2) of the solution (e.g., the least squares estimate 
will work as an initial estimate of the median for symmetric distributions). 
Nonetheless, this is substantially longer than least squares, especially since 
the constants implicit in the big-O terms are larger for simplex pivoting 
than for least squares methods. 

Modern interior point methods escape this problem by avoiding the 
boundary of the constraint set. Consider the canonical linear program 


min {dx | Ar=b, a > 0} (4) 


(where cis a vector, A is a matrix, and the inequalities are taken coordinate- 
wise). Associate with this problem the following logarithmic barrier refor- 
mulation, which severely penalizes points close to the boundary: 


min {B(x,q) | Ax =b} (5) 


where 
B(x, u) =Cx—p)> log zg. 


In effect, (5) replaces the inequality constraints in (4) by the penalty term 
of the log barrier. Solving (5) with a sequence of parameters u such that 
u — 0, we obtain in the limit a solution to the original problem (4). For 
each u, problem (5) can be solved relatively effectively by iterative Newton 
or quasi-Newton methods: at each trial solution, one approximates the 
problem locally by a quadratic minimization problem and moves as far as 
possible towards the solution of the approximating problem without leaving 
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the constraint set (that is, remaining in the interior). The historical context 
of these methods and the details of our use of this approach for regression 
quantiles is presented in Portnoy and Koenker (1997). It appears that such 
methods can be as reliable as simplex methods, and are substantially faster 
for large n. 

Nonetheless, the interior point algorithms were still much slower than 
least squares for very large n. In fact, the best rates for complexity avail- 
able from the interior point literature are of the order O,(n>/4p*) computer 
operations for random problems, though it has been conjectured that the 
n°/4 factor can be reduced to n logn. The second approach to accelerating 
the computation of regression quantiles is based on a stochastic preprocess- 
ing step that begins with a much smaller random subset of the data. To 
describe this approach, consider the Lı problem: 


n 
min >: Y; — x;b| . (6) 

i=1 
Suppose for the moment that we “knew” that a certain subset Jy of the 
observations N = {1, ..., n} would fall above the optimal median plane 


and another subset Jz would fall below. Then, since knowing the sign of a 
residuals permits the replacement of the absolute value by the appropriate 
sign, 


NY- = So Yi- bl- X Yi- arb) + $ (Yi ad) 


i=1 iEN\JLUJH iEJL icJyH 


It follows that 


n 
XO Yi- cb = X Yi- abl + Yr - rrol + |Y -xgb) (7) 
= iEN\JLUJH 


where £K = Diicy, Ti, and Yk = Vics, Yi for K € {H,L}. We will refer 
to these combined “pseudo-observations” as “globs” in what follows. It is 
not hard to show that minimizing (7), under our provisional hypothesis on 
the signs of the residuals, yields exactly the same solution as (6), but the 
revision has reduced effective sample size by #{Jz U Jy} — 2 (essentially 
by the number of observations in the globs). 

To find Jz and Jy, consider computing a preliminary estimate B based 
on a subsample of m observations. Compute a simultaneous confidence 
band for x1 based on this estimate for each 7 € N. Under plausible sam- 
pling assumptions the length of each interval is proportional to 1/,/m, so if 
M denotes the number of Y; falling inside the band, M = O,(n/,/m). 
Take Jz, Jg to be composed of the indices of the observations falling 
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outside the band. So we may now create the “globbed” observations 
(Yn,xK),K € {L,H} and reestimate based on M + 2 observations. Fi- 
nally, we must check to verify that, in fact, all the observations in Jz, Jz, 
have the predicted residual signs. If so, we are done; if not, we must repeat 
the process. If the coverage probability of the bands is P, presumably near 
1, then the expected number of repetitions of this process is the expecta- 
tion of a geometric random variable, Z, with expectation P7}. We will call 
each repetition a cycle. As described in Portnoy and Koenker (1997), it 
is possible to show that the optimal choice for the initial subsample size 
is m = O(n?/), and that the resulting size of the globbed sample is then 
M= O,(n?/ 3), Under modest distributional assumptions, this provides an 
algorithm with complexity Op(n?/5(log n)?p?) + O(np), where the last term 
comes from the computation of the globs and the checking of residuals. 
For n large and p moderate, this is strictly better than the rate O(np’) 
for least squares; and in fact the algorithm in Portnoy and Koenker (1997) 
is essentially as fast as Splus least squares algorithms for n < 10° and p 
moderate; and it is undoubtedly strictly faster for n much larger than this 
range. 

The algorithm presented in Portnoy and Koenker (1997) can be de- 
scribed formally, but briefly, as follows: 


k0 
[<0 
m + [2n?/3} 
while(k is small){ 
k=k+1 
solve for initial rq using first m observations 
compute confidence interval for this solution 
reorder globbed sample as first M observations 
while(1 is small) { 
l=l+1 
solve for new rq using the globbed sample 
check residual signs of globbed observations 
if no bad signs: return optimal solution 
if only few bad: adjust globs, reorder sample, update M, continue 
if too many bad: increase m and break to outer loop 


} 


Here, “rq? is the regression quantile problem, and an interior point 
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method is used to “solve” this problem. The remainder of the paper de- 
scribes several approaches to improving this algorithm and to obtaining a 
better understanding of its performance. 


2 Some simple improvements 


Three relatively simple and straightforward improvements in the algorithm 
above have been made and tested. The first concerns the choice of the initial 
random subsample. The algorithm above assumes that the sample comes 
already randomized. For samples that are not simulated, this requires and 
initial random permutation of all the data — a rather time-consuming task 
if n is very large. Note that the above algorithm does not permit simply 
taking an initial random subsample of size m, since m increases with each 
cycle in order to ensure termination of the iterations. One could take a ran- 
dom subsample of size m(k) (where k is the cycle number) at the beginning 
of each cycle, but this is still rather time-consuming. A faster approach is 
to form the “random” subsample by taking a random observation from 
each consecutive n/m observations. Although this is slightly different from 
taking a fully random permutation, is appears to be sufficiently random 
in all cases checked so far. It is extremely quick, requiring only m ran- 
dom uniforms and no sorting. It has the added advantage for real data of 
avoiding any large gaps in the sampling (as can occur with fully random 
permutations). The current implementation uses a simple multiplicative 
congruential generator. It appears to provide a modest improvement in 
timings at the cost only of keeping track of an addition random seed. 

The second relatively simple improvement involves choice of the simul- 
taneous confidence bands used to determine the residual signs for the first 
subsample. The algorithm above was originally programmed using the tra- 
ditional Scheffé bands of the form 


X 1/2 
v.06 + (cai(X'X) z) /6, 


where c is a constant (from F-tables) and 6? is an estimate of T(1 — 
T)/f?(F-1(r)). Unfortunately, these bands require np? operations, a value 
that can make the algorithm very time-consuming and, in fact, does not 
even attain the complexity rate claimed above. To get a faster approach, 
note that it is preferable to choose the constant c to optimize the speed 
of the algorithm rather than to attain a given coverage probability. Thus, 
different conservative alternatives might prove better. One method that 
seems to work well is based on the inequality, 
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p 
|x; 6] < max {|ĝ;|/s;} x Slee | 67; 
j=1 


where sj is G times the diagonal element of the (X’X)~! matrix, and 
ĝ computed as for the Scheffé intervals. This approach provides conser- 
vative (though not “exact”) confidence bands with width cg $}; |xi,| 83- 
Note that this requires only O(np) operations; thus providing the rate re- 
quired for the result of Section 1. Choice of the constant, cg, is somewhat 
problematic, but some experimentation with simulated data showed that 
Cg could be taken conservatively to be approximately one, and that the 
algorithm was remarkably independent of the precise value of cy. Although 
our initial experience with this approach is extremely promising, the next 
improvement discussed below permits the selection of cg to be replaced by 
a simpler selection of an alternative parameter. 

A third improvement concerns the size of the globbed sample. If the 
constant, c, in the simultaneous confidence intervals is fixed, the size of the 
globbed sample, M, is random. Since it is optimal to have M approximately 
equal to m, where m is the initial sample size, we have tried to choose the c 
so that M is near m with high probability (this is possible under model (1) 
as n — oo). In practice, M varies significantly, and this leads to difficulties. 
If M is too small, the final estimates are likely to be wrong, thus requiring 
extra cycles. If M is too large, the interior point algorithm takes much 
more time that it should. Since in the limit, M will be very near its mean, 
which is a constant times m, it seems reasonable to fix the sample size 
M = am, where a is a constant near 1. The globbed sample consists of the 
M = am observations with the smallest values of |r;|/z;, where r; is the 
residual and z; is the constant in the confidence intervals depending only 
on the design matrix. That is, z; = (2/(X'X)~1!a,)!/? for Scheffé intervals, 
and zi = Ð} |£ij| sj for the conservative intervals described above. This 
is asymptotically equivalent to fixing c, but always provides an appropriate 
globbed sample size. In a small scale simulation study, it appeared that 
a = .8 was a good choice over a variety of data set sizes and distribution 
assumptions. The lack of randomness in M provides a much more reliable 
and faster algorithm, whose performance varies substantially less from trial 
to trial. The result that a < 1 probably arises from the fact the the globbed 
sample is not like a random sample (as discussed in Section 4). Thus the 
interior point algorithm takes somewhat more time on a globbed sample 
than on a random subsample of the same size (perhaps 20 to 40 per cent 
more time, depending of the specific problem). 
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3 On the interior point algorithm 


For n < 10°, the algorithm of Section 1 spends almost all of its time in the 
interior point optimization steps for the random and globbed subsamples. 
However, the full advantage of the preprocessing step isn’t realized until the 
operations of complexity O(np) dominate (that is, until most of the time is 
spend in creating the globs and in checking residuals). Thus, improvements 
in the interior point algorithm should provide the most important sources 
of faster performance (for n < 10°). Furthermore, the use of nonlinear op- 
timization methods requires the specification of a number of performance 
parameters. These include: (ż¿) choice of starting values for the variables 
over which optimization is carried out, (ii) choice of direction at each step 
(for example, use of gradient direction for steepest descent, or use of the 
Newton direction to solve the approximating quadratic, or some combina- 
tion of these), (iii) specification of the distance to move along the chosen 
descent direction, (iv) selection of a method for updating the penalty coeffi- 
cient u in equation (5), and (v) selection of stopping criteria. Specification 
of these parameters provides ample room for fine-tuning the algorithm to 
provide better performance. 

Interior point methods for Lı problems generally replace the linear pro- 
gram in equation (3) by a related problem, called the dual problem 


max{Y’a | X'a= x'e, a € [0,1)”}. (8) 


where Y is the vector of response observations and X is the design matrix. 
Here the coordinates of a, {a;}, correspond to the signs of the residuals, 
ri = Y; — x,b*, at the final solution, a*, where b* is the optimal solution to 
equation (3). Precisely, at a solution, a; = 1 if r; > 0, a; = 0 if r; < 0, and 
a; is strictly between 0 and 1 if r; = 0. In the regression quantile setting, 
the values a;(T) (0 < T < 1) are exact analogues of the rank functions of 
Hájek and Šidák (1967): see Gutenbrunner and Jurečková (1992). Thus, 
the solution b* to the primal problem (3) can be determined from the 
solution to (8); and, in fact, the objective functions are the same at the 
solutions. The “primal-dual” algorithm of Portnoy and Koenker (1997) 
proceeds by establishing first order conditions depending on both a and 
b, and solving for both simultaneously. That is, in each iteration, both 
a and b are moved in a Newton direction toward the solution. At each 
iteration, the difference between the primal and dual objective functions 
is positive until the solution is reached. At the solution, this difference, 
called the “duality gap” becomes zero, and this provides a reliable test for 
convergence of the algorithm. Although each iteration of the primal-dual 
algorithm is somewhat more complicated, the number of iterations tends 
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to be somewhat smaller than other approaches, and the algorithm appears 
to be quite robust to idiosyncrasies in the data. 
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The algorithm in Portnoy and Koenker (1997) chooses the initial starting 
values as follows: bọ is chosen to be the least squares estimator, and ao is 
chosen to be a constant vector with all n elements equal to .5 (halfway 
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between the bounds 0 and 1). In an effort to see how this algorithm works, 
the values of the dual variables, a;, were plotted against 7 at each iteration. 
Plots were made for several random data sets with sample size 1000 or 
2000, values near typical initial sample sizes for problems with n between 
104 and 10°. One picture of successive iterations on normal data with 
p = 6 appears in Figure 1. The first plot gives the a-values after one 
iteration. Note that the values have separated from the initial constant .5 
and have begun to move toward 0 and 1. Remarkably, the vast majority of 
the values move exactly the same fraction of the distance to the extreme 
limits, 0 and 1. The values quickly approach the limiting values, except 
for the observations with zero residuals at the solution. The last iteration 
represents a very minor fine tuning of the next to last one, which already 
identified the zero residuals. This suggests the possibility of stopping a bit 
earlier, but in practice early stopping provides only a modest improvement 
in timings (at the cost of potentially less reliable performance). 

Similar plots were made when the primal-dual algorithm was applied to 
the globbed sample. Here the a;-values were plotted against |r;|/z;, where 
r; are the residuals and the z;-values are defined in the third improvement 
discussed in Section 2. That is, the a; are plotted against the order at 
which observations enter the globbed sample. A picture for Cauchy data 
with p = 3 is given in Figure 2. At first glance, the results appear even more 
remarkable. With the residual ordering taken into account, it is clear that 
the a;-values for smaller residuals tend to their limits much more slowly 
than those for larger residuals (among the globbed sample, for which all 
but the last two observations — the globs — have small residuals). Again, 
the a;-values tend to fall along lines, but the lines are not symmetric about 
.5, and the become highly curved after iteration 6. 

It is possible that the use of the least squares estimator as the initial 
starting value for applying the interior point algorithm to the globbed sam- 
ple might be responsible for some of the oddities in the plots. An alternative 
that should be somewhat better and might be somewhat faster is to use the 
B from the solution to the initial random subsample as the starting value 
for the globbed sample. It turns out that this does not affect the plots 
of the a;-values for the specific example plotted in Figure 2. However, in 
various simulation experiments, use of the initial B as the starting value 
for applying the interior point algorithm to the globbed samples appeared 
to provide a modest but definite improvement. These figures give some 
tantalizing hints as to why this is so, but significantly better understand- 
ing of the performance of interior point methods should provide even more 
substantial improvements. 
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4 On applying the algorithm recursively 


An obvious source of improved performance for very large samples would 
clearly be to apply the algorithm recursively. Since the algorithm gives the 
fastest computation of regression quantiles, the use of interior point meth- 
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ods should be replaced by the full algorithm using preprocessing whenever 
the subsample sizes are sufficiently large to make this replacement notice- 
ably faster. Generally, this would occur when n is somewhat larger than 
10°. Unfortunately, there is a serious problem with this replacement for 
solving the globbed problem. The stochastic preprocessing step assumes 
that the sample is a random one. The globbed sample is far from random 
— it is chosen to consist of the smaller residuals from the initial sample plus 
the two globs. Thus, the preprocessing step cannot be expected to provide 
any real reduction in sample size for the globbed sample. Figure 3 should 
make this clear. The first graph in Figure 3 plots a Normal sample of size 
10,000 together with the the whole-sample Lı line, the initial subsample 
Lı line and the confidence bands (based on the subsample). The solution 
to the initial subsample should differ from the correct sample regression 
quantile by an error of order Op(m~1/2) (where m is the initial subsample 
size). This error is of the same order as the width of the confidence bands. 
Therefore, as the first graph shows, the lines and confidence bands are ex- 
tremely close together on the scale of the data. The second plot of Figure 3 
shows just the globbed sample together with the correct Lı line and an Lı 
line based on a random subsample of the globbed sample. Clearly, any con- 
fidence band about this subsample line that contains the correct line must 
contain all but a relatively modest fraction of the globbed sample. That 
is, since the residuals should be roughly uniformly distributed between the 
bands (assuming the random errors have a smooth density near the desired 
quantile), it is clear that an appreciable fraction of them must have their 
residuals from a subsample estimate differ in sign from those of the correct 
estimate. Thus, for the globbed sample, it would be impossible to replace 
the problem by one with a sample size smaller than a constant fraction of 
the globbed sample size. This contrasts markedly with the reduction from 
n to n?/3 that preprocessing affords for random samples. 

It is possible to use the preprocessed algorithm for the initial random 
sample. A few simulations with n > 10° and p < 4 were tried with this 
modification, and a modest improvement (about 20%) was obtained. Un- 
fortunately, computer space limitations precluded more extensive testing, 
which remains to be done. 
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Recent developments in PROGRESS 


Peter J. Rousseeuw and Mia Hubert 


University of Antwerp, Belgium 


Abstract: The least median of squares (LMS) regression method is highly 
robust to outliers in the data. It can be computed by means of PROGRESS 
(from Program for RObust reGRESSion). After ten years we have de- 
veloped a new version of PROGRESS, which also computes the least 
trimmed squares (LTS) method. We will discuss the various new fea- 
tures of PROGRESS, with emphasis on the algorithmic aspects. 


Key words: Algorithm, breakdown value, least median of squares, least 
trimmed squares, robust regression. | 


AMS subject classification: 62F35, 62J05. 


1 Introduction 


At the time when the least median of squares (LMS) regression method 
was introduced (Rousseeuw, 1984), a program was needed to compute it in 
practice. The first algorithm described in that paper was just for computing 
the LMS line in simple regression, based on scanning over possible slopes 
while adjusting the intercept each time. 

However, it was clear from the start that an algorithm for LMS multiple 
regression was required. The first version of PROGRESS (from Program 
for RObust reGRESSion) was implemented in 1983. The 1984 paper al- 
ready contained an example analyzed with PROGRESS and listed the pro- 
gram’s computation times on a CDC 750, one of the fastest mainframes of 
that day but outperformed by today’s PC’s. During the next years, when 
people began requesting the program, it was made more user-friendly with 
interactive input and self-explanatory output. The use of the program was 
explained in detail in (Rousseeuw and Leroy, 1987). Because that book 
contained many sample outputs we refrained from making any substantial 
modifications to PROGRESS, which remained essentially unchanged from 
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1985 until 1995. 

During that decade there were quite a few suggestions for modifications 
and extensions. For instance, several people asked for the inclusion of the 
least trimmed squares (LTS) method which had been proposed together 
with LMS in (Rousseeuw, 1984) but which was not built in from the start 
because it needed (a little) more computation time, which became less rel- 
evant with the increasing speed of hardware. Another idea was to improve 
the accuracy by carrying out intercept adjustments more often, and we 
also wanted to allow the user to replace the ‘median’ in LMS by another 
quantile. Therefore we finally gave up on the principle of keeping the out- 
puts identical to those in the 1987 book, and created the modernized 1996 
version of PROGRESS described in the present paper. 

First of all, PROGRESS now allows the user to choose between two 
robust estimators: the least quantile of squares (LQS) method which gen- 
eralizes LMS, and the least trimmed squares (LTS) method. By definition, 
these methods depend on a quantile h/n. In order to help the user make an 
appropriate choice of h, the program provides a range of h-values for which 
LQS and LTS have a breakdown value between 25% and 50%. (This means 
that the method can resist that many contaminated observations.) Sec- 
tion 2 describes the LQS and LTS, and Section 3 obtains their breakdown 
value which depends on h. 

Section 4 provides an outline of the algorithm used for the LQS and LTS. 
Since their objective functions are difficult to minimize exactly, PROGRESS 
performs an approximate resampling algorithm. Whereas the 1985 version 
adjusted the intercept only once at the end, the intercept in now adjusted 
in each step. This yields a lower objective function value, and in simple 
regression we even find the exact minimum. The program now allows to 
search over all subsets, as well as over a user-defined number of random 
subsets. 

In Section 5 we define a new version of the robust coefficient of deter- 
mination (R°) to make sure that it always takes on values in the interval 
[0,1]. Finally, Section 6 discusses the robust diagnostics which the program 
provides to identify outliers and leverage points. Section 7 explains how 
the program can be obtained. 


2 The estimators LQS and LTS 


We consider the linear multiple regression model 
Yi = 15101 + ziha +... + Liplp + oe; = xto + 0@; (1) 


for i = 1,...,n. The p-dimensional vectors x; contain the explanatory 
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variables, y; is the response and ce; is the error term. The data set thus 
consists of n observations and will be denoted by Z = (X,y). For a given 
parameter estimate 6 we denote the residuals as r;(0) = yi — x9. Ina 
regression model with intercept, the observations satisfy £ip = 1. 

A robust regression method tries to estimate the regression parameter 
vector ĝ in such a way that it fits the bulk of the data even when there are 
outliers. 

The new version of PROGRESS provides two such robust regression 
methods: the least quantile of squares (LQS) and the least trimmed squares 
(LTS) estimator. Whereas classical least squares (LS) minimizes the sum 
of the squared residuals, LQS and LTS minimize a certain quantile, resp. 
a trimmed sum, of the squared residuals. Their exact definition is given 
below. (For any numbers uj,...,Un the notation uin stands for the i-th 
order statistic.) 


Definition 1 Let Z = (X,y) be a data set of n observations in IRP. 
Then for allp < h < n, the least quantile of squares (LQS) estimate 
6rQs(Z) and the least trimmed squares (LTS) estimate 0,7rs(Z) are defined 


by 


O1qs(Z) = argmin(r*(8))iin = argmin|r(O) Inn (2) 
and 
On7rs(Z) = arymin | (r?(6))i: (3) 
i=1 


It is easy to see that LQS generalizes the LMS method which minimizes 
the median of the squared residuals. Indeed, for n odd and h = [n/2] +1 
the LQS becomes the LMS. 

With the parameter estimates 6;95(Z) and On7r5(Z) we can associate 
estimators of the error scale a: 


stqs(Z) = sras(X,y) = canir(9r9s(Z)) lan (4) 
and : 
s~rs(Z) = siTs(X,y) = dan 5 (r2 (Ôrrs(Z)))in- (5) 
i=l 


The constants Chn and d;,, are chosen to make the scale estimators con- 
sistent at the gaussian model, which gives 


h+n 
Chin 1/87 a ETI 


— fo o(1/cayn): 
Chin 
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Moreover, szos is multiplied by a finite-sample correction factor. For h = 
[n/2] +1 this factor equals 1 + =e 
More efficient scale estimates, based on the preliminary ones, are then 


given by 
wr 
= fare (6) 
ad aa 


TAE 0 if ri/srQ(T)s]| > 2.5 
1 otherwise. 


Here the notation szro(r)s stands for szos or srs, whichever is used. 


where 


3 Breakdown value and choice of h 


In the next theorem we derive the breakdown value of LQS and LTS, which 
says how many of the n observations need to be replaced before the estimate 
is carried away. The finite-sample breakdown value (Donoho and Huber, 
1983) of any regression estimator T(Z) = T(X,y) is given by 


e% = e}(T, Z) = min {—;sup |T(Z’)|| = 00} 
Z! 


where Z’ = (X’,y’) ranges over all data sets obtained by replacing any m 
observations of Z = (X,y) by arbitrary points. We will assume that the 
original X is in general position. This means that no p of the x; lie on 
a p — 1 dimensional plane through the origin. For simple regression with 
intercept (p = 2) this says that no two x; coincide. For simple regression 
without intercept (p = 1) it says that none of the x; are zero. 


Theorem 1 If the x; are in general position, then the finite-sample break- 
down value of the LQS and the LTS is 


se mae if pSh< 
een he lie ae 


The proof is given in the Appendix. 


[5] 
a 


Corollary 1 If the x; are in general position, then the mazimal finite- 
sample breakdown value of the LQS and the LTS equals 


(n — p)/2} +1 


* 
max € = 
pore n 


and is achieved for 


[(n + p)/2] < h < [(n + p + 1)/2]. 
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When n+p is even we have [(n+p) /2] = [(n+p+1)/2] hence the optimal 
h is unique. When n+p is odd, it turns out that choosing h = [(n+p+1)/2| 
gives the better finite-sample efficiency. Therefore, we will always define 
the optimal h as 


hopt = [(n + p + 1)/2]. 


This is also the default value of h in PROGRESS. If the user prefers to 
use another quantile, the program displays a range of h-values for which a 
breakdown value of at least 25% is attained. The lowest h-value allowed in 
the program is 

hin, = [n/2] +1. 


(This is because for each h < hmin there exists some h > hmin with the 
same breakdown value and a higher finite-sample efficiency.) 


Remark 1 Ifp= 1 and tip = 1 for all observations, the regression model 
reduces to the univariate model yi = u+ae;. In that case Theorem 1 is still 
valid, whereas the LQS and LTS become much easier to compute. In the 
univariate setting, a fast algorithm is available to compute the exact LQS 
and LTS estimates of the location parameter u and the scale parameter o. 


Table 1: Overview of the program PROGRESS. 


4 Outline of the PROGRESS algorithm 


PROGRESS not only computes the LQS and LTS. First, the LS estimates 
and inferences about the regression parameters are obtained. And after 
the LQS or LTS is found, a reweighted least squares (RLS) is carried out 
with weights based on LQS or LTS. Table 1 gives a schematic overview of 
the complete program. Since the essential algorithmic changes have been 
made in step 5, we will focus on that part here. We refer to (Rousseeuw 
and Leroy, 1987) for all details about the treatment of missing data, the 
standardization procedure, and the LS and RLS estimates. 
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lb ast ACTION RESULT 
draw a (random) subset of p observations 
compute hyperplane through these p observations | 9 = (61,...,0,—-1, 6p) 


3 | if regression with intercept 6 = (01,...,Op-1, 0p) 
=> adjust intercept 


evaluate the objective function at this estimate Ir(Õ)ln:n or HORN 
5 | repeat steps 1 until 4, and keep the OLQS or OLTS 
estimate with lowest objective function value 


Table 2: Summary of the algorithm for LQS and LTS. 


In general the objective functions of LQS and LTS are difficult to 
minimize exactly since they have several local minima. For this reason 
PROGRESS uses an approximate resampling algorithm (which does yield 
the exact solution in simple regression). Table 2 summarizes the main steps 
of this algorithm. 


smal 


intermediate | all subsets CP 
random default (Table 4) or user-defined 


random default or user-defined 


Table 3: Subsampling mechanism in PROGRESS. 


mechanism | number of p-subsets used 


We will describe the first three steps more extensively. 


1. draw a (random) subset of p observations 


The drawing mechanism now implemented in PROGRESS is displayed 
in Table 3. According to the sample size n and the number of variables p, 
PROGRESS checks whether or not it is feasible to draw all subsets of p 
observations out of n. 

For small values of n (see Table 4) the program automatically generates 
all possible subsets of p observations, of which there are C? = (o): For each 
of these p-subsets, steps 1 to 4 of Table 2 are carried out. 

If n is large for the p involved, the binomial coefficient would exceed 
1,000,000 and then PROGRESS switches to a random selection of p-subsets. 
It is possible for the user to preset the number of p-subsets to be considered. 
The more p-subsets you take, the lower the objective function will be, but 
at the cost of more computation time. On the other hand, one must select 
enough p-subsets for the probability of drawing at least one uncontami- 
nated p-subset to be close to 1 (otherwise, the fit could be based on bad 
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observations only). In (Rousseeuw and Leroy, 1987, page 198) this minimal 
number of p-subsets is expressed in function of the number of variables and 
the allowed percentage of contamination. The default number of subsets 
drawn in PROGRESS can be found in Table 4. For p < 9 these numbers 
exceed the required minimum, whereas for larger p the default is fixed at 
3000 subsets so as to avoid extremely long calculations. But as already 
mentioned, the user can always modify the proposed number of p-subsets. 

Finally, for all intermediate values of n the user can choose between 
considering all p-subsets or drawing a certain number of random p-subsets. 
As always, the program applies default choices unless the user explicitly 
asks to override them. 


ee ae 
aye ps pty ss | 6] 7 ys yl 3 y 10] 
n is ‘small’ if n < 500 50 22 17 15 14 0 0 0 0 
n is ‘large’ if n > 106 1414 182 71 43 32 27 24 23 22 
default number 
of p-subsets used 500 1000 1500 2000 2500 3000 3000 


Table 4: Sample sizes n which are considered to be small or large (for 
a given p). Also the default number of p-subsets used in PROGRESS is 
listed. 


2. compute hyperplane through these p observations 


If the x; are in general position then every p-subset determines a unique 
hyperplane, that is found by solving the linear system formed by these p 
observations. 

In practice also a singular p-subset can occur, and then PROGRESS 
draws a new p-subset. The output then reports the total number of singu- 
lar p-subsets that were encountered. 


3. if regression with intercept = adjust intercept 


Here, ‘intercept adjustment’ stands for a technique which decreases the 
objective value of a given fit. We will apply it to each p—subset. After the 
hyperplane through the p observations is determined, we have an initial 
estimate of the slope and the intercept, given by 9 = (01,---, 9-1, Op) 
where 6, is the intercept. The corresponding objective value for LQS then 
equals 


Ir( lnn = ly; — 25101 — . - - — £i p-1bp-1 — Îplh:n- (7) 


For LTS we can rewrite (3) accordingly. The adjusted intercept 0, is then 
defined as the LQS (resp. LTS) location estimate applied to the univariate 
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data set {t; = yi — 710} Se eee Ti p-1ĝp—1;i S bhele 
0, = aremin [tj — lain (8) 


for LQS. By construction, (8) yields a lower objective value than (7). In 
simple regression (p = 2), it follows from (Steele and Steiger, 1986) that if 
all 2-subsets are used and their intercept is adjusted each time, we obtain 
the exact LQS. 

As indicated in Remark 1, the LQS and LTS location estimates can be 
found by an explicit algorithm. For LQS it is the midpoint of the shortest 
interval that contains h observations, as was proved in (Rousseeuw, 1984, 
page 873). We thus have to order the univariate observations {t1,...,tn} 
to tin <<... < tn:n and then compute the length of the contiguous intervals 
that contain h points. When the smallest length is attained by several 
intervals, we take the median of the corresponding midpoints. 

The univariate LTS estimator corresponds to the mean of the subset that 
contains h observations and that has the smallest sum of squares. This sum 
of squares is defined as the sum of the squared deviations from the subset 


mean: given an h-subset tj-n,.--,ti4p—1-n with mean £) we have 
= t+h-1 | 
SQ® = > tn Et) 2. 
j=i 


Note that the selected h-subset has to consist of successive observations, 
which is why we had to order the t1,...,tn first. 

For a recent study of the effect of intercept adjustment on the perfor- 
mance of LTS regression, see Croux et al. (1996). 

In order to adjust the intercepts the univariate LQS and LTS methods 
were included into PROGRESS, which also allows the user to analyze data 
sets that were univariate from the start. As in the regression situation, the 
preliminary scale is then defined by (4) resp. (5), both of which come out 
of the univariate algorithms. For LQS it is half the length of the shortest 
interval, whereas for LTS it is the square root of the smallest sum of squares 
divided by h. We then obtain the final scale estimate as in (6). 


5 Coefficient of determination (R°?) 


Let us first consider the regression model with intercept. Along with the 
classical least squares (LS) comes the coefficient of determination, which 
measures the proportion of the variance of the response variable explained 
by the linear model, i.e. 


Var(y:) ~Var(ri) _ 4 _ Varto <1, (9) 


< 2 
a Var(y:) Var(y;) 
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The denominator in this expression measures the variability of the response 
in a model without explanatory variables, which in this case is the univari- 
ate model y; = u+ce;. If we denote the LS coefficient estimate of a sample 
(X,y) by 0Ls(X,y), and use the scale estimate given by 


\ a 
s{s(X, y) = "PEN N (y: = XiĝLs(X, y))*, (10) 
i=l 
we can rewrite (9) as 
(X,Y) 
Peasy) 11 
LS s? .(1,y) ( ) 


By analogy, we propose the robust counterpart given by 


Stocrys(X: y) 


12 
stgrs(1.y) 02) 


Riacrys = 1 — 


Note that when using definition (12), the robust coefficient of determi- 
nation always falls in the interval [0,1]. This was not guaranteed by the 
earlier version of Rĉ defined in (Rousseeuw and Leroy, 1987, page 44) and 
implemented in the first version of PROGRESS. There the denominator 
(mady)? = (med; |y; — med; y;|)* was used, whereas now szgs(1,y) = 
ly — 6r9s(1,y)| h:n Which is just the scale estimate of the univariate LQS 
applied to the response. 

An analogous reasoning works for the regression model without inter- 
cept. In that case, the model without explanatory variables reduces to 
yi = oe; without any location parameter. For LS, 


st.5(X,y) Dilyi — X18)" 
Rog CONE A Sy ea ee (13) 
an s? (0, y) B y? 


and we propose the following robust counterparts: 


stos(XY) _ , _ (Y — xôros)”)nn 
(Y )n:n 


and 
_ Sis(Xy) _ 1 _ Zilly = xbrrs) in 


Rirs = r 
st rg(0, y) 4 (Y lin 
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6 Diagnostics 
Observations in regression data essentially belong to four types: 


regular observations with internal x; and well-fitting yi, 
vertical outliers with internal x; and non-fitting yi, 
good leverage points with outlying x; and well-fitting y;, 
bad leverage points with outlying x; and non-fitting y;. 


Figure 1 shows these four types in simple regression. Regression diag- 
nostics aim to detect observations of one or more of these types. Here we 
will consider three robust diagnostics: standardized residuals, the resistant 
diagnostic, and the diagnostic plot. 


vertical outlier 


good leverage point 


bad leverage point 


Figure 1: Simple regression data with points of all four types. 


1. Standardized residuals are defined as r;(0)/s(0) where s(@) de- 
notes a robust scale estimate based on the residuals. Here we will use 
sLQs(0) = Ch n\r(O)|h:n OT sirs(@) = dhn nt r2(8)i:n- Standardized 
residuals help us to distinguish between well-fitting and non-fitting obser- 
vations by comparing their absolute values to some yardstick, e.g. 


compare |r;(8)|/s(0) to 2.5. 


We use the yardstick 2.5 since it would determine a (roughly) 99% tolerance 
interval for the e; if they had a standard gaussian distribution. Since the 
standardized residuals approximate the e;, we will consider an observation 
as non-fitting if its standardized residual lies (far) outside this tolerance 
region. 
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2.5 


vertical outliers 


+ 
bad leverage points 


regular observations good leverage points 


vertical outliers 


standardized LQS or LTS residual 


+ 
bad leverage points 


resistant diagnostic 


Figure 2: Classification of observations by plotting the standardized resid- 
uals versus their resistant diagnostics. 


2. The resistant diagnostic. Non-regular observations have the property 
that they are ‘far away’ from some hyperplane in IRP** (that is, further 
away than the majority of the observations). The vertical outliers and the 
bad leverage points are clearly far away from the ideal regression plane given 
by y = x‘@. But also a good leverage point lies far away, relative to some 
other hyperplane that goes through the center of the regular observations. 
To define the ‘distance’ of an observation (x;,y;) to a plane y = x*0 we can 
use its absolute standardized residual. If we now define 


U; = sup BACAN | or U; = sup rOl 
@ sLQsS(O) 6 SLTS(O) 
we expect outliers to have a large U;. Since the U; are difficult to compute 
exactly, we approximate them by taking the maximum over all Ô that are 
computed inside the LQS/LTS algorithm. For each observation, this yields 
the value 


i a ð 

u= e i: (9) or Uui = max JO) (14) 
a sros(ð) 0 sLTS(0) 

Therefore we only need to store one array (u1,...,Un) that has to be up- 


dated at each p-subset. Finally, we define the resistant diagnostic for each 
observation by standardizing the u;, yielding 
Ui 


resistant diagnostic; = ————-. 15 
gnostic, med u; (15) 
Pea R 
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In the new version of PROGRESS the resistant diagnostic is available for 
both LQS and LTS, and it is based on the trial estimates 6 after location 
adjustment. From (14) it is clear that non-regular observations will have 
a large u; and consequently a large resistant diagnostic. Simulations have 
indicated that we may consider (15) as ‘large’ if it exceeds 2.5. Combining 
the standardized residuals with the resistant diagnostic leads to the dia- 
gram in Figure 2. However, a disadvantage of Figure 2 is that it cannot 
distinguish between vertical outliers and bad leverage points. 


3. The diagnostic plot makes the complete classification into the four 
types. Since leverage points are outlying in the space of the regressors X;, 
one can distinguish them from vertical outliers by analyzing their x—compo- 
nents. For this we can run MINVOL on X = {x;;1 <i < n}. This program 
computes the Minimum Volume Ellipsoid (MVE) location estimate T(X) 
and scatter matrix C(X). The MVE is a highly robust estimator of lo- 
cation and scatter, introduced by Rousseeuw (1985). The corresponding 
robust distance of an observation to the center is then given by 


RD(xi) = yx: — TOX) (x: -— T(X)). 


Since the squares of these distances roughly have a chi-squared distribution 
when there are no outliers among the x;, we will classify an observation as 
a leverage point if its RD(x;) exceeds the cutoff value X3.0.975" If we 
combine this information with the standardized LQS or LTS residual, we 
obtain the diagnostic plot of (Rousseeuw and van Zomeren, 1990) shown 
in Figure 3. 


cutoff 


vertical outliers bad leverage points 


regular observations good leverage points 


standardized LQS or LTS residual 


vertical outliers bad leverage points 


robust distance RD(x;) 


Figure 3: Diagnostic plot, obtained by plotting the standardized robust 
residuals versus the robust distances RD(x;). 
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7 Software availability 


The programs PROGRESS and MINVOL can be obtained from our website 
http: //win-www.uia.ac.be/u/statis 

Questions or remarks about the implementation can be directed to 
Mia.Hubert@uia.ua.ac.be. 

The LMS, LTS and MVE methods are also available in S-PLUS as the 
functions lmsreg, ltsreg and cov.mve. Moreover, the functions LMS, 
LTS and MVE based on the recent versions of PROGRESS and MINVOL 
have been incorporated into SAS/IML (Version 6.12) in 1996. Their docu- 
mentation can be obtained by writing to sasaxs@unx.sas.com. 


Appendix: Proof of Theorem 1 


We show that the proof of the breakdown value of the LMS (Rousseeuw, 
1984, page 878) remains valid after making the necessary modifications. 
The proofs for LQS and LTS are very similar, so we will mainly consider 
the LQS estimator. 

First suppose h < [7+1], We obtain the lower bound on ež by re- 
placing h — p + 1 — 1 = h — p observations of Z, yielding Z’. Define 
0 = ÎzLos(Z) and 0' = ĝros(Z'). The n — h +p > h original points (x;, yi) 
then satisfy |r;(0)| = |y; — x:0| < M = maxz |r;(@)| such that for the cor- 
rupted data set Z’, |ri(™’)lnn < |ri(@)|n:n < M. For p > 1 we refer to the 
geometrical construction of (Rousseeuw, 1984). In his notation, the set 
Z' \ A contains at most n — (n — h + p — (p — 1)) = h — 1 observations. If 
we assume ||0' — 6|| > 2(||6|| + M/p), this implies 


Iri) lan > M, (16) 
a contradiction. Therefore, ||@’|| remains bounded. For p = 1 we set 
C = 2M/N = 2M/minz|z;i|. Now suppose |0 — 6’| > C. For all non- 
contaminated observations we have that |r;(0) — 7;(@’)| = lyi — zi — Yi — 


x;0'| = |x,||@ — 0'| > NC = 2M, from which we get |ri(@")| > |ri(@) — 
r;(0')| — |ri(@)| > 2M — M = M. Again this implies (16) and thus a 
bounded 6’. The upper bound on ež follows from the fact that we can put 
h — p + 1 bad observations on a hyperplane that contains p — 1 original 
points. Then (h — p + 1) + (p — 1) = h observations satisfy y; = x,0’, and 
thus 6’ = Or9s(Z' ). Making the hyperplane steeper will break down the 
estimator. 

For h > [4], we obtain the lower bound analogously to the previous 
case. Just observe that we now have n — (n — h+ 1) + 1 = A original 
observations, and that Z’ \ A has at most n — (h — (p— 1)) =n- h + 
p—1 < h—1 points. The remaining inequality e}, < (n — h + 1)/n can 
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be proved as follows. Take some M > ||6||. Then we show that we can 
always construct a corrupted sample Z’ with n — h + 1 bad observations, 
such that ||6’\| = ||@z9s(Z’)|| > M. Letting M go to infinity will then 
cause the LQS to break down. Define Mx = max; ||x,||. Now we set all the 
n—h-+1 replaced observations equal to the point (x,y) = (x,2MxM + K) 
for which ||x|| = Mx and K > 0. These replaced observations satisfy 
xO] < [[xa[[[]Ol| < x| M = MxM < y and thus |r;(@)| = |y: — xi8| > 
ly| — |x0] > MxM +K. Asn-—h+1>n-—A this yields |r;()|p:n > 
MxM + K. Since we can choose K arbitrarily large, the minimum of the 
objective function of LQS will not be reached for ||@|| < M. Consequently 
IIl = llĝLos(Z')|| has to be larger than M, which ends the proof. Finally 
we note that using the same construction, the objective function of LTS 
satisfies 


h 
N (r?(9))in > (MxM +K}? 
i=1 
yielding the same result. 
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Dimension reduction via parametric 
inverse regression 
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Abstract: In this paper, a linear subspace containing part or all of the 
information for the regression of a m-vector Y on a p-vector X and its 
dimension are estimated via the means of inverse regression. Smooth 
parametric curves are fitted to the p inverse regressions through a mul- 
tivariate linear model, without imposing any strict assumptions on the 
error distribution. This method is expected to be more powerful in re- 
ducing the dimension of a regression problem when compared to SIR, 
the estimation procedure proposed by Li (1991), that is based on fitting 
piecewise constant functions to the inverse regression curves. 


Key words: Dimension reduction, regression, linear subspace estimation. 


AMS subject classification: 62A99, 62H05. 


1 Introduction 


Let Y € R” and X € R? with joint cumulative distribution function (c.d.f.) 
F(Y, X). In a regression setting the behavior of the conditional cumulative 
distribution function of Y given X, F(Y|X), as the value of X varies in 
its marginal sample space is under study. As a means of characterizing the 
regression structure, consider replacing X by k < p linear combinations of 
its components, n? X,..., nf X, without losing information on F(Y|X) so 
that, for all values of X, 


F(Y|X) = F(Y mi X,- -30 X) = F(Y In" X) (1) 


where 77 is the p x k matrix with columns nj, and F(-|-) denotes the con- 
ditional c.d.f. of the first argument given the second. Equation (1) holds 
trivially when 7 = Ip, where I, denotes the identity matrix of dimension 
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p, and thus it imposes no restrictions on F(Y|X). It can be expressed 
equivalently as 


Y LXIX (2) 


where the notation U IL V|W in (2) means that U is independent of V given 
any value for W (Dawid, 1979). Both (1) and (2) express the fact that the 
conditional c.d.f. of Y| X depends on X only through nf X, the coordinates 
of a projection of X onto the k-dimensional linear subspace spanned by the 
columns of 7. Consequently, n? X can be used in place of X without loss 
of information on the regression. 

An example where (2) holds is the additive-error regression model 


Y|X =g] X,.. n% X) +€ 


where e IL X and E(e) = 0. 

For any vector or matrix a, let S(a) denote its range space and dim(S(q)) 
denote its dimension. If (1) holds then it also holds with ņ replaced by any 
basis for S(7). In this sense, (1) and (2) can be regarded as statements 
about S(n) rather than statements about 7 per se. Thus, when (2) holds 
we follow Li (1991, 1992) and call S(ņ) a dimension-reduction subspace for 
F(Y |X) or for the regression of Y on X. 

Obviously, the smallest dimension-reduction subspace provides the great- 
est dimension reduction in the predictor vector. Unfortunately, smallest 
or minimum dimension-reduction subspaces (Cook 1994a) are not always 
unique. To circumvent the latter, Cook (1994b, 1996) introduced the notion 
of central dimension-reduction subspaces: 


Definition 1 A subspace S is a central dimension-reduction subspace for 
the regression of Y on X if (a) S is a dimension-reduction subspace and 
(b) S C Sars for all dimension-reduction subspaces Sars, i.e. S = NSdrs. A 
central dimension-reduction subspace will be denoted by Sy)x(-). 


The intersection of all dimension-reduction subspaces MSqg,rz5 is trivially 
a subspace but it is not necessarily a dimension-reduction one. Also, it 
is easy to see that a central dimension-reduction subspace is a minimum 
dimension-reduction subspace but the converse is not always true. In fact, 
there are regression problems for which the central dimension-reduction 
subspace does not exist. A detailed discussion of these issues can be found 
in Cook (1994b, 1996). 

By definition, a central dimension-reduction subspace, being the inter- 
section of all dimension-reduction subspaces, is unique when it exists. The 
existence of central subspaces can be assured by placing fairly weak restric- 
tions on aspects of the joint distribution of Y and X (Cook 1994a, 1996). In 
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this paper, we concentrate on regressions where central dimension-reduction 
spaces exist. 

The subspace Sy|x is considered a “super-parameter” that is used to 
index the conditional distribution of Y given X and its estimation is the 
main theme of this work. Throughout the rest of this article, the columns 
of the p x k matrix 7 form a basis for the central space Syy, and k is used 
to denote its dimension. 


1.1 Inverse regression and SIR 


Methods are available for estimating portions of the central subspace Sy) x 
if we are willing to place certain conditions on the marginal distribution of 
the predictors. The method that will be presented in this article is based 
on inverse regression. 

Let Szx\y) denote the subspace spanned by { E(X|Y) — E(X): Y € 
Qy }, where Qy C R™ is the marginal sample space of Y. The condition 
that the marginal distribution of the predictors X must satisfy in order 
for inverse regression to be useful in estimating a portion of the central 
subspace is stated in the following theorem. The theorem, as presented 
by Li (1991), is based on an arbitrary dimension-reduction subspace which 
need not be central. However, the version here is stated in terms of the 
central subspace. Throughout this article, boldface capital Latin letters 
will denote matrices, even though other symbols will also be used for the 
same purpose provided there is no fear of confusion. 


Theorem 1 Assume that the central subspace Sy; x(n) exists for F(Y|X), 
and that, for allb € RP, E(b’ X|n? X) is linear in n! X. Then the centered 
inverse regression curve E(X|Y) — E(X) satisfies 


E(X|Y) — EX) € S(Zen) 


Equivalently, 
Saxiy) C S(r) = U2Sy\x 
where Siz = Cov( X). 


Proof: Li (1991) proved Theorem 1 for any dimension reduction subspace 
S(n) so that (2) is satisfied. It is obvious that if the theorem holds for an 
arbitrary dimension reduction subspace, it also holds for the intersectio n of 
all dimension reduction subspaces; that is, the central dimension reduction 
subspace Sy;x, provided it exists. O 

The linearity condition on E(b’ X|nT X) is required to hold only for the 
basis 7 of the central subspace. 7 being unknown, in practice we may 
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require that it hold for all possible 7, which is equivalent to elliptical sym- 
metry of the distribution of X (Eaton, 1986). Li (1991) mentioned that the 
linearity condition is not a severe restriction, since most low-dimensional 
projections of a high-dimensional data cloud are close to being normal (Dia- 
conis and Freedman, 1984; and Hall and Li, 1993). In addition, there often 
exist transformations of the predictors that make them comply with the 
linearity condition. Cook and Nachtsheim (1994) suggested re-weighting of 
the predictor vector to make it elliptically contoured. 

In the next corollary, which follows directly from Theorem 1, the anal- 
ogous result is given for a standard random vector. Suppose that Xz > 0 
and let Z be the standardized version of X, 


Z = 53"? (X —E(X)) 


Obviously, E(Z) = 0 and Cov(Z) = Ip. Also, since Z is a 1-1 and onto 
linear transformation of X, Y 1L X|ņn! X if and only if Y 1L Z |G? Z, where 
C= yt! or ĝi = 2 EA 


Corollary 1 
E(Z|Y) € S(Zz/"n) = $(B) = Syz 


This corollary readily implies that E(Z|Y) = PgE(Z|Y), where Pg is 
the orthogonal projection operator for S() with respect to the usual inner 
product. 

Corollary 1 also implies that Sg(zjy) is a subspace of Sy)z. This does 
not guarantee equality between Spg(zjy) and Syz, and thus, inference about 
Srzjy) possibly covers only part of Sy)z. For example, if Y = Z?, with 
Zı being the first coordinate variable of Z, and if Zı is symmetric about 
its mean, then E(Z|Y) = 0 even though Syjz = span((1,0)7). For a 
broader discussion of the inability of SIR, and consequently of the method 
developed in this paper, to diagnose this symmetric dependence see Cook 
and Weisberg (1991). The missed part of Sy)z might be recovered from 
higher order moments of the conditional distribution of Z given Y (Cook 
and Weisberg 1991; Li 1992), but such issues are not addressed in this 
article. We assume throughout that Sp(zjy) is non-trivial, in the sense 
that it contains non-zero directions, should they exist. 

Theorem 1 and Corollary 1 lead to the use of inverse regression as an 
estimation means of part or possibly the whole of the central dimension- 
reduction subspace. One such method is SIR (Sliced Inverse Regression), 
proposed by Li (1991). In SIR, the range of the one-dimensional vari- 
able Y is partitioned into a fixed number of slices and the p components 
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of Z are regressed on Y, a discrete version of Y resulting from slicing its 
range, giving p one-dimensional regression problems, instead of the possi- 
bly high-dimensional forward regression of Y on Z. Then, a rather crude 
nonparametric estimate of the inverse curve E(X|Y) serves to estimate the 
central dimension-reduction subspace. SJR includes an asymptotic test for 
inferring about d, a lower bound on k. 

But, even though Li (1991) introduced an innovative way of reducing 
the dimension in a regression problem, SJR has limitations of which the 
most important is that SIR can be ambiguous about the estimate of the 
dimension as the latter depends sometimes crucially on the choice of the 
number of slices. This can be easily avoided by using standard regression 
estimation techniques. 

In this article, smooth parametric curves are fitted to the p inverse 
regressions in order to estimate the central subspace Sy; x(n), without im- 
posing any restrictions on the dimension of the response vector Y. 


2 Parametric inverse regression 


For simplicity assume that X is standardized to have 0 mean and the 
identity covariance matrix. To model the conditional expectation of X 
given Y, a multivariate linear model is fitted with X being the response, 
XT = (x1,...,Zp), and Y, YT =(y,.-., Ym), the explanatory vector. Let 


T Bi Bia e Bip 
eji w| =[f00 e 40] ] OP 
Tp Bai ba? vo Bop 


where the f;’s are arbitrary, R-valued linearly independent known functions 
of Y. Suppose that a random sample of size n is available on (Y, X). Then, 
including a matrix of errors Ep, the model becomes 


X,|Y = ZnB + En (3) 


where Xp = (xij), a n x p random matrix, Zn = (fu), a n x q fixed matrix 
with fa = fi(Y:), and B = ((),;), the q x p matrix of coefficients. The error 
matrix E,, satisfies 


E(E,|Y)=0 and Cov(E,|Y) = Uy, D In 


where Xz), is a p x p positive definite, unknown matrix, that does not 
depend on Y. Xn, Zn, and E,, are indexed by the sample size n to indicate 
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their dependence upon it. The symbol ® denotes the Kronecker product. 
Clearly, the rank of Zn is q. We assume that n > p in order to avoid trivial 
cases. No distributional assumptions on the errors are made except that 
the rows ene of the error matrix En are independent with mean 0 and 
constant covariance matrix dp). 

According to (3), Sgixy) is the linear subspace of Sy; which is spanned 
by the rows of Z,B; that is S(B7Z7) = S E(x|y): Therefore, since 
rank(B’ Z?) = rank(Z,B), rank(Z,B) < dim(Sy)x), and the rank(Z,,B) 
is a lower bound on the dimension of the central dimension-reduction sub- 
space. 

But, 

rank(ZnB) = rank(B? Z? Z,B) = rank(B) 


since ZZ, is a positive definite matrix (see [A4.4], Seber, 1977). Thus, 
the rank of Z,B is actually equal to the rank of B, and hence inference 
on the dimension of Sg(xy) can be based solely on B in the sense that an 
estimate of the rank of B constitutes an estimate of a lower bound on the 
dimension of Syjx. 

The estimate of B to be used for inference on the rank of B is the 
ordinary least squares estimate, given by 


Bn = (Zn Zn) Za Xn (4) 


3 The asymptotic distribution of Bẹ, 


Let ef”) be the n-vector with 1 in the ith place and zeroes elsewhere. We 
are interested in the asymptotic distribution of y/n (Bn — B). 
Let H,, denote the covariance matrix of yn (Bn — B), 
rn’ 
Cov(/n (Bn — B)) 
aly 8 (Zi Zn/n)~' 


H, 


The notation ||- || max identifies the norm on the vector space of matrices 


defined by 

(ij) Ihmax = max |aig| 
for a matrix = (aij). The following lemma about the asymptotic distri- 
bution of yn (Bn — B) follows readily from Theorem 2.4.3, Bunke and 


Bunke (1986), and the multivariate version of Slutsky’s theorem (see [A 
4.19], Bunke and Bunke, 1986). 
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Lemma 1 Let A be the space of all pq x pq positive definite matrices and 


let F be the space of distributions of the errors En. If 


H, — He Mz, (5) 
then PENERE 
vn (Ên — B) + Ngp(0, H) (6) 


provided the following three conditions are satisfied 
(Zn Zn) Za Imax = 0071?) (1) 


sup | |z| dF(x)—0 as coco (II) 
FEF J|\z||>c 


inf Amn(=)>r>0 (III) 
DEM (F) 


where 
M(F) = {| wel dF (a): F € F} C M? 
RP 


The error distributions that are usually considered satisfy Conditions 
(II) and (II). 
Assume that there exists a matrix € M7, so that 


(Zz, Zn/n)~* — G (7) 


as n — oo. Also, assume that a consistent estimate 2a is available, 
as Lig), is usually unknown. For instance, Da can be taken to be the 
matrix of residuals from the regression of X on Y divided by either n or, 
n —rank(Z) = n —q, the denominator choice that makes Sly unbiased for 


“aly (the proof is omitted). Let 


xly 


Hn = Daly @ (Z Zam)" (8) 


Then, if (7) holds, 
H, >H (9) 
(9) is a direct application of the triangle inequality and the fact that con- 


tinuous functions of consistent estimates are themselves consistent. These 
remarks result in the following corollary to Lemma (1). 


Corollary 2 Suppose Conditions (I), (II), and (III) of Lemma 1 hold. 
Also assume that X,),, is a consistent estimate of Ez, and that (7) holds. 
Then, 


xly rly? 


a a ea 
Vn A"? (Bn - B) > N(0, Ipa) = N(0, Ip @ Iq) (10) 
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Proof: Since ay is consistent for 4), and (7) holds, (10) is the result of 
a direct application of the multivariate version of Slutsky’s theorem (see {A 
4.19] in Bunke and Bunke, 1986), and of Lemma 1. O 

Let d = dim(Sxx\y)). We have shown that d = rank(B) and thus, 
we can use the least squares estimate of B, to estimate the dimension of 
Skx|y) as follows. Let Aj, j =1,...,min(q, p), be the singular values of B. 
Then, d is the number of the nonzero singular values of B, and inference 
about d can be made by testing if 


i min(q,p) 
AW =n `o d; 
j=d+1 


is equal to zero. We have no direct access to AW? , but by observing that 
the rank of a matrix is not affected when the matrix is multiplied by a 
nonsingular matrix, the inference on d can be based on the following test 
statistic 


sti min(q,p) 
Ayan YO o (11) 
j=d+1 
where Èj are the singular values of 
Zi Zn Py -1/2 
(nE (12) 


(12) is used in place of B,, for convenience, as its asymptotic covariance 
matrix is the identity. Now, the test is based on the asymptotic distribution | 


of AW 


4 The asymptotic distribution of AW) 


Given the asymptotic normality of the least squares estimate of B, we 
can obtain the asymptotic distribution of the singular values of a fixed 
nonsingular transformation of B based on a result about the asymptotic 
distribution of the singular values of a matrix by Eaton and Tyler (1994). 


Theorem 2 The asymptotic distribution of AW? defined in (11) and (12) 


is X(p-a)x(q—a): 


Proof: Consider the singular value decomposition of G~!/*B i where 


G is the positive definite limit matrix of n (Z2Z,)7}, 


2_ r| DO lar 
y -rt| 5 at 


z| 


Gepy 


z| 
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D is a dxd diagonal matrix with the positive singular values of G~1/ Ba á 


along its diagonal. Partition Tf = (Tiri) :q4xq, Di :qxd, 
Dy :q X (q —d), 


T r4 
l3 = TZ. : PXP 
22 
where T4 : dxp, T} : (p— d) x p. By the Eaton-Tyler result, the 
limiting distribution of the smallest q — d singular values of 


Zi Zn z 
vn (SE, y) 
is the same as the limiting distribution of the singular values of the (q — 
d) x (p — d) matrix 


Vai By = vn (R222 at) Bn Say) T22) 
By (10), the asymptotic distribution of \/n B, is 


Zi Zn = 
=B, D = T22) + N@-ay(q—a) (0, Ip—d ® lq- d) (13) 


vn T(= 

But then, oo has the same asymptotic distribution as the sum of the 
T ie? cases 
squares of the singular values of y/n (Ca 2B, *T 22) which is 
2 

X(p—d) x (qd) by (18). O 

Note that the asymptotic test derived above is equivalent to the usual 
F-test for testing d = 0 when p = 1; that is, when we fit q functions of Y 
on the one-dimensional X and we test the overall validity of the model by 
testing the hypothesis 611 = b21 =... = Ogi = 0. 


4.1 A summarizing theorem 


All the key results discussed and proved in the previous sections are sum- 
marized in the following theorem. 


Theorem 3 Assume that X,|Y = ZnB+En, with E(En) = 0, Cov(En) = 
Daly @ In, where Xn : Nn Xp, Zn :nxq, B :q xp, with rank(Zn) = q 
Let Bn = = (ZT Zn) ZX, be the ordinary least squares estimate of B. 

Let ly be a consistent estimate of Uz), and G, 1 = Z"Z,/n. Assume that 
Gn, — G pointwise, where G is a q x q positive definite matrix. Then, 


= aoa oa D 
Vn (Èy rie 9 G7") (Ên — B) > pq(0, Ip 8 Iq) 
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Let d = rank(B). Also, let Èj, j =1,...,min(q,p), be the singular values 
of Ga Ban Then 


z| 


is asymptotically distributed as a Xq d)(p—d) random variable. 


Proof: Both results are immediate consequences of Corollary 2 and The- 
orem 2. O 

We can use the asymptotic distribution of AW to estimate the rank 
d of B, or equivalently the dimension of the subspace S E(X|Y) C Sy\x, 


as follows: Fix 7 with 0 < 7 < q. Compare A to the quantiles of a 
Kas Pa if it is bigger, conclude that d > j; if not, conclude that d < j, 
and repeat the procedure. 


5 An example 


To illustrate the method, we consider the Horse Mussel Data: The data 
consist of a sample of 201 horse mussel measurements collected in the Marl- 
borough Sounds, which are located off the northeast coast of New Zealand’s 
South Island (Camden, 1989). The response variable is muscle mass M, 
the edible portion of the mussel, in grams. The quantitative predictors are 
shell width W , shell length L, in mm, and shell mass S in grams. The 
actual sampling method is unknown, but we assume that the data are i.i.d. 
observations from the overall mussel population. The R — code (Cook and 
Weisberg, 1994) was used for the computations. 

In Figure la a scatterplot matrix of the response, shell length, shell 
width and shell mass is presented. It is evident that the linearity condition 
needed for SIR to work may be violated. The transformed variables W1/? 
and S/4 will be used in place of W and S, respectively, so that the linearity 
condition is satisfied by the regressor variables. 

Theorem 1 applies to the transformed data and SIR can be used to 
estimate the central dimension-reduction subspace. The results of applying 
SIR to the regression of M on L, W1/? and 91/4 are given in Tables 1 and 
2; Table 1 contains the results when 5 slices were used and Table 2 when 
20 slices were used. The rows of both tables summarize hypothesis tests of 
the form d = 7 versus d > j. For example, the first row gives the statistic 
Ao = 154.7 with (p—d)(H —d—1) = (3—0)(5—1) = 12 degrees of freedom 
and a p-value of 0.000. As it can be seen from the two tables, SIR gives 
contradictory results: it estimates the dimension to be 1 or 2, depending 
on the number of slices used. 
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Now, consider fitting smooth parametric curves. The scatterplot matrix 
in Figure 1b suggests fitting quadratic curves on all three inverse regression 
plots. The results of the analysis are given in Table 3. 

The test indicates a one-dimensional structure supporting that one linear 
combination of the regressors can be sufficient to characterize the behavior 
of the conditional c.d.f. of M given L, W1/2, L1/4: 


0.0235275L + 0.003176W 1/2 + 0.00541707S1/4 (14) 
The same conclusion of one-dimensional structure is also reached using 


regression graphical techniques to estimate the structural dimension of the 
this regression problem (see Cook and Weisberg, 1994). 


er “he o 52 
a% eo%? C 
of 23 °’ 
@ € OD N ° jl 
nas z 418 g 
ope +., °? 
riPin E “al 
fi, lee 


: 68 “yee P m 
a hell Wid Fai ge 
e ° 420 - B | a | 
331 of ef K vote 
hg bad a p o 
phell Leng E é 
132 $ ‘ 


a. untransformed predictors b. transformed-predictors 


Figure 1: Scatterplots of the Mussel Data 


Table 1: SIR results for H = 5 Table 2: SIR results for H = 20 


A; DF p-value A; DF p-value 
0 154.7 12 0.000 O 177.2 45 0.000 
1 14.81 6 0.022 1 31.55 28 0.293 
2 4.973 2 0.083 2 9.877 13 0.704 


6 Discussion 


In order to estimate a lower bound on the dimension of the central dimension- 
reduction subspace Sy;x, the conditional expectation of the standardized 
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AY DF p-—value 


0 502.9 9 0.000 
1 6.998 4 0.136 
2 3.8210E-27 1 1.000 


Ci 0 


Table 3: Parametric results for the Mussel data 


X given Y was modeled according to the linear model (3) placing relaxed 
conditions on the error distribution, namely zero mean and constant co- 
variance structure. The decision on what model to fit is based on data 
inspection. The select ed model should be a sufficiently complex model 
that accommodates the data. For example, if polynomials are fitted, the 
degree should be a number that provides a good fit to all inverse regression 
curves. 

An asymptotic x? test for the dimension d of Szx|y) was obtained as 
a result of the asymptotic normality of the least squares estimate of B. 
The estimated dimension is in fact an estimate of a lower bound for the 
dimension of Sy|x. 

The d eigenvectors of the least squares estimate of B, that correspond 
to its d largest eigenvalues, multiplied by Z, yield estimates of d of the 
basis vectors of Sy;x. They, in turn, can be scaled back to estimates 
of basis vectors of the central dimension-reduction subspace for the non- 
standardized X, by multiplication with 2 N where $, is the moment 
estimate of Xz. 

These results can be extended to the non-constant covariance structure 
model, under certain conditions. In addition, a similar test has been devel- 
oped for the case where the inverse regression curves are not all of the same 
shape. This test does not have an asymptotic distribution with quantiles 
as easy to compute as these of a x7. All of the above developments can be 
found in Bura (1996). 

The technique of this article does not suffer from most of the short- 
comings of SIR and requires neither the marginal distribution of X to be 
normal nor Y to be one-dimensional. Further research is needed to assess 
the sensitivity of the method to ou tliers. The power of the test is also 
expected to be higher due to the fitting method. 

As an aside, it is worthwhile to comment that even though the esti- 
mation procedure developed in this paper was motivated by the use of 
inverse regression as a means to reduce the dimension of a forward regres- 
sion problem, it is also a method of estimating the linear subspace spanned 
by a regression curve. In this context, if the linear subspace is estimated to 
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be {0} and this cannot be attributed to symmetric dependence (see Cook 
and Weisberg, 1991), we can possibly infer that the regression curve is 
intrinsically nonlinear. 
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Abstract: A reduction paradigm is a theoretical framework which provides 
a definition of structure for multivariate laws, and allows to simplify their 
representation and statistical analysis. The main idea is to decompose a 
law as the superposition of a structural term and a noise, so that the latter 
can be neglected without loss of information on the structure. When the 
structural term is supported by a lower-dimensional affine subspace, an 
ethaustive dimension reduction is achieved. We describe the reduction 
paradigm that results from selecting white noises, and convolution as 
superposition mechanism. 


Key words: Multivariate structure, multivariate dimension-reduction, 
multivariate graphics. 
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1 Introduction 


A k-variate law is a complex object whose structure embodies both marginal 
and joint features. All those features can be translated, to some extent, 
into geometric characterizations of an iid sample from the law, meant as a 
cloud of points in IR’. Dimension does not affect the analysis of marginal 
features, but as k increases it becomes progressively harder to conceive and 
articulate the joint ones. For example, how does one conceive and articu- 
late the interdependencies among, say, 10 or 100 coordinate components? 
One is often forced to neglect high-order interactions, and/or to assume 
hierarchies among them !. At the same time, for k > 3, the data cannot 


‘Conditional independence (see A.P. Dawid, 1979) provides a key to articulate inter- 
dependencies; a very interesting representation of them through conditional independence 
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be visualized as a whole; while graphical tools can still be used to investi- 
gate low-dimensional marginals, a direct graphical investigation of the joint 
features is impossible. 

Producing inferences in high-dimensional settings can then become com- 
plicated and challenging. A large variety of inference methods is available 
once strong assumptions on the nature of the law are imposed; that is, 
once a model for the law is chosen (see, among others, M.L. Eaton, 1983, 
R.J. Muirhead, 1982, and G.A.F. Seber, 1984). But the intuition based 
on graphical preliminary exploration that should precede the utilization of 
model-based methods is impaired by the conceptual and practical difficul- 
ties mentioned above. 

These considerations, among others, justify the quest for simplified rep- 
resentations of multivariate laws, especially ones allowing a reduction in 
dimension. Simplified representations are often developed targeting some 
(more or less restricted) features of interest. Exhaustiveness becomes then 
an issue; once a target has been chosen, the information concerning it ought 
to be preserved by simplification. More generally, it ought to be clear in 
what relation the proposed simplified representation is to the target. If ex- 
haustiveness is not always guaranteed, it should be possible to state under 
what assumptions on the nature of the law it is, and/or to establish to what 
extent the target is preserved (with or without assumptions). 

These issues are very relevant in practice; the last thirty years have 
witnessed the development of a large number of graphical exploration pro- 
cedures for high-dimensional data sets. Think for example of Principal 
Component Analysis, Factor Analysis (see G.A.F. Seber, 1984, and refer- 
ences therein), Projection Pursuit (H.J. Friedman and J.W. Tuckey, 1974, 
H.J. Friedman, 1987, and D. Cook, A. Buja, J. Cabrera and C. Hurley, 
1995), or Grand Tours (D. Asimov, 1985, and A. Buja and D. Asimov, 
1986). The theoretical rationale underlying any of these procedures can 
be interpreted as a simplified representation of the multivariate law from 
which the data are drawn; targets range anywhere from “variability”, to 
“linear interdependence structure” (correlation among the coordinate com- 
ponents), to “non-linear structure” (defined as departure from normality), 
to “structure” according to some other definition. Correspondingly, many 
of the critiques to these procedures can be interpreted in terms of choice of 
targets, and relations between simplified representations and targets. As 
we proceed, it will become clear that the simplified representation under- 
lying Factor Analysis is the closest in spirit to the one we will propose. In 
fact, Factor Analysis differs from the other procedures mentioned above by 


graphs is given by J. Whittaker, 1990. 
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its reference to a latent factor entirely embodying the correlation target. 

Our focus will not be on techniques to make inference on simplified 
representations (“population” objects) based on data from a multivariate 
law, but on the theoretical premises for these techniques; that is, on how 
to define targets, and how to develop simplified representations guaranteed 
to embody them exhaustively. 

In Sections 1 and 2, we introduce the concept of reduction paradigm and 
provide definitions and some key results. Section 3 concerns dimension re- 
duction. We conclude with a brief summary and some remarks on inference 
in Section 4. More details can be found in F. Chiaromonte, 1996. 


2 The reduction paradigm 


Our analysis will be conducted at the level of laws on IR*, and we will not 
distinguish among random vectors with the same distribution. The main 
idea behind a reduction paradigm is to decompose a law L on IRF into two 
terms, one of which does not contribute to the structure (the target) and 
can therefore be neglected. In other words, the aim is to represent a law 
as the superposition of a structural term and a noise, or no-structure term. 
Hence, the specification of a reduction paradigm relies upon 

- a definition of absence of structure; that is, a choice of noises 

- a choice of superposition mechanism 
which, conversely, determine a definition of structure. We have selected 
white noises Ng(0, BIk), 8 € R4, and convolution. Hence, we write 


L= Ag(L) * Ng(0, Ik) (1) 
or, in terms of characteristic functions 
Bjal 
p, (u) = Pr) (uje a lull -i u € IRA (2) 


This is by no means the only possibility, but it is in line with much of the 
statistical tradition and thus constitutes a very natural first step. In fact, 
it expresses a situation in which an independent normal error is additively 
superimposed to the object of interest. One can envision reproducing the 
whole analysis we are about to develop with different noises and/or super- 
position mechanisms, though. As far as noises are concerned, one could 
take, for example, uniforms on hyper-spheres of radius p € Ri, or normals 
with independent components N;(0, Diag(o;)), 7 € IRE. In the first case 
one maintains the weakly spherical nature of white noises and loses inde- 
pendence of the coordinate components, while in the second case one loses 
weak sphericity and maintains independence. Regarding superposition, one 
could explore, for example, multiplicative (instead of additive) schemes. 
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Before proceeding let us remark that the reduction paradigm we have 
selected, as well as any other conceivable one, while certainly constituting 
a model for decomposing a law, does not require strong assumptions on the 
nature of the law itself. A reduction paradigm can be applied without fixing 
at the outset a model for the law; that is, without assuming at the outset 
that the law belongs to a given (and possibly finally parameterized) class. 
Furthermore, our reduction paradigm corresponds to the inverse problem 
for heat-type diffusion of probability measures (for an easy introduction, 
see G.M. Wing, 1991, and A. Friedman and W. Littman, 1994). Paradigms 
resulting from a different choice of noises would correspond to inverse prob- 
lems for processes with different kernels. 

Indexing the structural term by 8 serves to stress the fact that the 
decomposition in (1) and (2) is not unique, unless it holds only with 8 = 0, 
and therefore Ag(L) = L itself. The set 


B(L) = {8 ER} s.t. ¢,(JeZlOM is a ch. fet.) CRE 


expresses the range of possible decompositions. It is always non-empty, 
as it must contain 0, and is easily shown to be B(L) = [0,6,(L)], where 
GBo(L) = sup B(L) = max B(L). We call the corresponding structural terms 


Ag(L) > ¢,()e2Ol’, B € B(L) 


sources, Bo(L) reduction coefficient, and A,(L) > @¢,(-)e primary 
source of L. Notice that reduction coefficient and primary source are unique 
by construction. If @.(L) = 0, so that the only (and thus primary) source 
of L is L itself, we say that the law is irreducible. We call it reducible 
otherwise. 

All sources share the structure of L, and can be equivalently taken as 
exhaustive “simplified” representations of the law. The primary source is 
the one in which no error is superimposed to the structure; that is, the one 
in which we have pushed simplification as far as possible. Hence, we will 
select A,(Z) as simplified representation of L, and write 


L = A,(L) * Nz (0, Bo(L) Ir) 


We can fix ideas using the normal case as an example. Here and in the fol- 
lowing, F ) indicates the orthogonal projection operator onto the argument 
subspace *, with respect to the standard inner product on IR*. Let 


p 
L = Nk Ç > nP, + | 


Bot) |\(.)\|2 


j=l 


2The reference, throughout our discussion, is to linear subspaces, and affine subspaces 
obtained by translating them. 
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where 2 a Nj Fy, +nÈ, is the spectral decomposition of the covariance with 
(distinct) eigenvalues 7,...,%p,7 in decreasing order, and corresponding 
elgenspaces Vj,...,Vp, V. It is easy to show that 


h, (u)ez llull? = exp ez Z su (Zo Z B)F,, +(n- ar) “ 


j=l 


is a characteristic function if and only if )%_4(n; — B)F,, + (n — B)P, is 
non-negative definite; that is, if and only if G < n. Hence, 6,(L) = n and 
correspondingly 


p 
Ao(L) = Nk (. X (n - 8, 
j=l 
It is then clear that a normal is irreducible if and only if the smallest 
eigenvalue of its covariance is 0; the irreducible k-variate normals are all 
and only the ones supported by lower dimensional affine subspaces, and 
they constitute the primary sources of non-singular normals. 

Primary sources are irreducible by construction. The class of all irre- 
ducible laws on IR* represents the repertoire of possible structures. The fol- 
lowing proposition provides a sufficient condition for irreducibility, thereby 
characterizing part of such repertoire. 


Proposition 1 If there exists a measurable set B C RË such that Leb(B) 
> 0, but L(B) = 0, then L is irreducible. 


Proof: Suppose 6o(L) > 0. Then, for any choice of v € IR*, Nz(v, Bo(L) Iz) 
is mutually absolutely continuous with respect to Leb. So Leb(B) > 0 
implies 

Nx(v, Bo(L)In)(B) > 0, Ww € R* 


and thus 
L(B) = a _Ne(v, Bo(L)e)(B) Ao(L)(dv) > 0 


contradicting our assumption. We can conclude that 6,(L) = 0, and there- 
fore that L is irreducible. O 

Since we have selected white noises as no-structure terms, reducible 
laws must be mutually absolutely continuous with respect to the Lebesgue 
measure, because they “contain” a term that is. As a consequence, all laws 
having “thick” holes with respect to the Lebesgue measure are irreducible 
in IR*. In particular, laws whose affine support As(L) has dimension < k 
are irreducible in IR*; we saw an instance of this with irreducible normals. 
So are laws whose closed support Cs(L) is bounded, regardless of whether 
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the latter is full-dimensional or embedded in a subspace or affine subspace 
of dimension < k. 

Notice that existence of an everywhere positive density is not enough to 
guarantee reducibility; again because of our choice of no-structure terms, 
reducible laws’ densities must have “thick enough” tails. It is easy to show 
that a law with an everywhere positive density whose tails vanish too fast, 
at least along some directions, will still be irreducible (see F. Chiaromonte, 
1996). 


3 Some affine actions, and marginalizations 


We will now explore the effects on reduction of some affine actions and of 
marginalizations. 


Proposition 2 Let T,, ,[L] be the law of rRX — v, where X € RÒ is 
any random vector distributed according to L, v € RF, r € R}, and Ris a 
rotation of IR®. Then BolT , 2[L]) = Bo(L) and AQ(T,,. p[L]) = T,,. p[Ao(L)].- 


v,7T,R 


Proof: For r = 0, the transformation yields a point-mass at —v, and the 


statement is trivially true. Otherwise, using characteristic functions, one 
has 


(u) = e¢ (rR'u) 


— „luv / — BL) 1p R'ul|2 
= E Pract) (rRuje 2 ee) 

_ r?Go(L) ul|2 
= Q; paoue ? i 


v,T, 


so BolT, „rlL]) > r°Bo(L), and T,,. ,[Ao(L)] is a source of J, „p[L]. But for 


vr, R 


r # 0 our transformation is invertible: Z7} [] = 7 |]. Hence 


—v,1/r,R! 
P, (u) = by (u) 


—v,1/r,R! To r,R [z 


_ Q/r)?Bo(% r RIL) 


~ C shat [Ao(T, r, RIE] (u)e ° 


lull? 


and 6,(L) > (1/r)} BolT „nlL]).- We can conclude that 6o (T „rlL]) = 
r? Bo(L), and therefore that J, „p[^o(L)] is indeed the primary source of 
Tajn 

The reduction coefficient is not affected by rotations and translations, 
and is multiplied by the square of a rescaling factor. Thus, rescalings, 
rotations and translations of L result into corresponding rescalings, rota- 


tions and translations of the primary source. In the following, we will use 
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interchangeably the terms marginalization and projection. Besides the in- 
tuitive correspondence, “invariance” under rotations makes this rigorous; 
the choice of orthonormal basis does not matter. 

In our discussion so far, we have considered the reduction of a law L on 
IR? in IR*. The reference to the space is important; laws on IRF that are 
entirely concentrated on some subspace can also be meant as laws on such 
subspace, and reducing them within the subspace can produce a different set 
of sources, a different reduction coefficient and a different primary source. 
Laws that are entirely concentrated on a subspace of dimension < k are 
irreducible in IR*, but they might still be reducible within the subspace. 

The noises within a given subspace S' C IR* are represented by N;,(0, GP,), 
pE Ri. Notation-wise, when considering the reduction of alaw L (entirely 
concentrated on S) within S, we will write 6,(L,S), Ao(L, S), etc. 


Proposition 3 Let M,[L] be the law of P,X, where X € IRF is any ran- 
dom vector distributed according to L, and S C RF is a non-degenerate 
subspace. Then Bo(M,|L£],S) > Bo(L) and 


Ao(M;s[L], S) * Ne (0, oF) = Ms [Ao(L)| 


where a = Bo( M,L], S) — Bo(L). In particular, if Cs(A,(L)) is bounded, 
Bo(Mg [L], S) = Bo(L) and Ao(M,g[L], S) = Mg[Ao(L)]- 


Proof: Using characteristic functions, one has 


Pm (u) < h, (Fu) 


Bo(L) 2 
Pro) sue” 2 (lFsul 
_ Bo(L) 2 
brictrocen WE 3+ ||P, ull 


so Bo(M,[L],.S) > Bo(L), and M,[{A,(L)] is a source of M,[L] within S. 
Equating the right hand side above with the right hand side of 


Pn 


S 


Bo(Mg [L),S) 
-= Rul? 


[L] (u) = Proms [L], S) (u)e s 
we obtain T 

Pr.o(MgIL1,8) (uje=ž sul = Pate {ho(L)) (u) 
where a = bo( M, [L], S) — Bo(L); that is 

Ao( Ms [L], S) * Ne (0, oF) = Ms[Ao(L)] 


Now, assume that Cs(A,(L)) is bounded. We need to show that this implies 
œ = 0. Suppose a > 0. Then, because of the normal term 


Cs(M, [o(L)]) = Cs(Ao(Mg[Z], S) * Ne (0, aP,)) = S 


236 Francesca Chiaromonte 


But if Cs(M,[A.(Z)]) is unbounded Cs(A,(L)) must be unbounded, too, 
contradicting our assumption. We can conclude that 6,(M,([L], S) = G.(L), 
and therefore that A,(M,[Z],S) = M,[A,(L)]. 0 

The reduction coefficient (within S) of the marginal of L, must be greater 
than or equal to 6,(L). Correspondingly, the marginal of A,(L) is a source 
(within S) of the marginal of L, even though not necessarily the primary 
one. Under the assumption that Cs(A,(Z)) is bounded, the reduction coef- 
ficients coincide. Thus, the marginal of A,(L) is indeed the primary source 
(within S) of the marginal of L. In other words, under the boundedness 
assumption the reduction coefficient is not affected by marginalizations 
(projections), and therefore marginalizations of L result into correspond- 
ing marginalizations of the primary source. 


4 The structural subspace, and exhaustive 
dimension reduction 


The affine support of A,(Z) represents the smallest affine subspace sup- 
porting the structure of L, as defined by our reduction paradigm. We 
call the subspace underlying it the structural subspace of the law S,(L) = 
As(Z,[Ao(L)]), where v is any element of Cs(A,(L)), and Z, stands for J), ,, . 
Correspondingly, we call do(L) = dim(S,(L)) the structural dimension. 
Whenever d,(L) < k, our (exhaustive) simplified representation of L im- 
plies a drop in dimension. 

This allows us to define an exhaustive dimension reduction. Let us see 
how. Suppose we know v € Cs(A,(L)). Then, the exercise of identifying 
A,(L) is equivalent to that of identifying A,(Z,[L]). In fact, by Proposition 2 


Ao(L) = T_,7,[Ao(L)] = T, [AZL] 


Now, suppose $,(L) is known, too. Then, we can marginalize Z [L] to S,(L) 
preserving all the information relative to the structure, as defined by our 
reduction paradigm. In fact, again by Proposition 2 


So(L) = As(Z,[Ao(L)]) = As(Ao(Z,[L])) 
so that indeed A,(7J,[L]) is supported by the structural subspace °, and 
Ao(Z,[L]) = Mest) [Ao(Z,[L})] 
3Notice that, by Proposition 2, the structural subspace is invariant under translations 
of L : S.(J,[L]) = S.(L). When translating by an element of Cs(A.(L)) we obtain a law 


which is actually supported by the subspace itself, instead of an affine subspace parallel 
to it. 
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But then, by Proposition 3 


No(Z,[L]) = Ao( Ms.) LIL], So(L)) * Ne (0, aP; cr) 


o(L) So(L) 
A,(7,|L]) is a source for M 4IL] within S,(L). Furthermore, if we can 


assume Cs(A,(Z)), and therefore of Cs(A,(Z,[L])), to be bounded 


AQ(Z,[L]) = No(Mg, (1) Z [LE]; So(L)) 
that is, A,(7Z,[L]) is the primary source of M, ,)7,[L] within S,(L). This 
gives an even stronger meaning to the exhaustiveness of our marginaliza- 
tion; not only no structural information is lost marginalizing Z [L], but the 
exercise of identifying A,(Z[L]) (to be performed in k dimensions) would 
actually correspond to that of identifying A,(M, ..)Z[L],So(L)) (to be 
performed in —possibly— smaller dimension). 

The question becomes then how to identify translation term and struc- 
tural subspace. Clearly, existence of finite moments of a certain order for 
L implies that of the corresponding moments for A,(L). If L admits finite 
first order moments E(A,(L)) = E(L), and one can take as translation term 
v = E(L) € Cs(A,(L)). Furthermore, if L admits finite second order mo- 
ments, structural subspace and structural dimension can be related to the 
spectral decomposition of the covariance. We will need the following 


Lemma 1 If L admits finite second order moments, then As(T,,,,|L]) = 
Span Cov(L)). 


Proof: Consider the orthogonal complement of Span(Cov(L)) with respect 
to the standard inner product, and Mgpan(cov(z))+ Zaz) |L]. Easy calcula- 
tions give 


E(Mgpan(Cov(L))+ ET) [Z]) = 0, Cov(Mgpan(Cov(L))+ Tatr) [L]) = 0 


Thus, Mgpan(Cov(L))+ Zaz) |L] is a point-mass at 0, which implies Span 
(Cov(L)) 2 As(Z,,,[Z]). On the other hand, using the definition of co- 
variance we have 


Cov(L)z = Cov(Z,,,[L])z = [ a yet Told) 
°\"E(L) 


which implies Span(Cov(L)) © As(%Z,,[L]). The statement follows. O 
Now, denoting by Ind(-) the indicator function of the argument condi- 
tion, we have 


238 Francesca Chiaromonte 


Proposition 4 Suppose L admits finite second order moments. Let n(L) 
be the smallest eigenvalue of Cov(L), and V(L) the corresponding etgenspace. 
Then BoiL) < n(L) and 


So(L) = V(L)* © Ind(n(L) — Bo(L) > 0)V (L) 
with do(L) = k — Ind(n(L) — bBo(L) = 0)dim(V(L)). 


Proof: Writing Cov(L) = X$- nj (L)F,, ay + ML)F,,) one has 


V(L) 


Cov(A,(L)) Cov(L) — po(L)Ik 


2 (m;(2) — Bo(L)) E) HOL = Bo(L)) Fa) 


But then 6,(L) < n(L) is implied by non-negative definiteness of Cov(A,(L)), 
and the expression for the structural subspace follows from Lemma 1 ap- 
plied to A,(Z). A drop in dimension occurs if and only if 6,(L) = n(L), 
and when it occurs d,(L) = k — dim(V(ZL)), where dim(V(L)) represents 
the multiplicity of n(Z). 0 

Given the spectral decomposition of Cov( L), the above proposition pro- 
vides an upper bound for the reduction coefficient and a lower bound for 
the structural subspace; namely, the smallest eigenvalue of Cov(L) and the 
orthogonal complement of its eigenspace. The spectral decomposition of 
Cov(L) is not enough to identify the structural subspace, though; we still 
need to know whether the reduction coefficient is strictly smaller than, or 
equal to, the smallest eigenvalue. 

Remember that for a normal law 6,(L) = n. Thus, under normality the 
drop in dimension always occurs, and one has S,(L) = V+ with d,(L) = 
k—dim(V) < k—1. It is important to remark that coincidence of G,(L) with 
n(L) (and therefore the drop in dimension) is not guaranteed in general. 
Identifying the reduction coefficient with the smallest eigenvalue of the 
covariance can actually be very misleading. Take for example a “noisy” 
uniform on a hyper-cube L = Un([-6, 6]*) * Nk(0,TIk), 0,7 € Ri \ {0}. 
For such a law one has §,(L) = T < o + 7 = n(L) and S,(L) = RÝ 
> {0} = V(L)+, as the multiplicity of h +7 isk. 


5 A brief summary with some remarks on 
inference 


The ultimate aim within the framework defined by a reduction paradigm is 
that of making inference about the (unobservable) A,(Z), which constitutes 
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our simplified and yet exhaustive representation of the original law. As we 
have seen, if we can assume that L is normal, @.(L) = 7 and 


j=l 


which is entirely identified through the mean vector and the spectral decom- 
position of the covariance. Hence, if the data are consistent with normality, 
we could estimate the primary source based on estimates of those. Also, if 
the data can be transformed to approximate normality, the primary source 
could be estimated on the transformed scale. What can we do when the 
data contradicts normality on the original scale, and fails to approximate 
it also after applying normalizing transformations? 

An intermediate aim is constituted by estimating a v € Cs(A,(L)) and 
So(L). Besides the intrinsic interest, if indeed our simplified representa- 
tion implied a drop in dimension, having such estimates would allow us to 
perform an exhaustive dimension reduction. 

Given the results in the previous sections, we are clearly at an advantage 
if we are willing to assume boundedness of Cs(A,(L)). Since the latter 
implies existence and finiteness for all the moments of L, we would have 

E(L) € Cs(A,(L)) and (Proposition 4) 


S.(L) = V(L)* @ Ind(n(L) — Bo(L) > 0)V(L) 


Furthermore, we could restrict inference on the reduction coefficient to 
any arbitrarily small non-degenerate subspace. In fact, by Proposition 3 
Bo(L) = Bo(M,[L], t), where t is any line in IR*. Thus, we could take E(L) 
as translation term, and produce an estimate of the structural subspace 
based on #(L), V(L) and B,(M,[L], t). Methods to estimate E(L), and, less 
trivially, 7(L) and V(L), exist in the literature and are not affected by how 
large k is (see M.L. Eaton and D. Tyler, 1994, and E. Bura, 1996). 

As a matter of fact, in order to produce an estimate of the structural 
subspace we would only have to assess, selecting for example t C V, whether 
Bo( M, [L], t) is strictly smaller than, or coincides with var(M,[Z]) = n(L). 
This, in turn, is equivalent to assessing whether M,[L] is a 1-dimensional 
normal. 

Under the assumption that Cs(A,(Z)) is bounded, we also have that 


No(L) = T_,[Mo(Msg, i) RIE], So(L))] 


Hence, we could center the data cloud translating it by E(L), and restrict 
any further analysis to the projection of the centered cloud onto S,(L); 
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all the structural features (except for location, which is captured by E(L)) 
would be preserved. If indeed do(L) = dim(S$,(L)) < k, we would have 
achieved an exhaustive dimension reduction. 
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Abstract: Partial residual plots are one of the most useful graphical pro- 
cedures in the exploratory fitting of data sets. They are frequently used 
in the identification of unknown functions, g, of predictor variables. Tra- 
ditionally these plots have been based on least squares (LS) fitting. It 
is well known that LS estimates are sensitive to outlying observations. 
The examples and sensitivity study discussed in this paper show that 
this vulnerability to outliers carries over to the LS based partial residual 
plots. A few outliers in the data set can distort the LS partial residual 
plot making the identification of g impossible. Furthermore, if g is non- 
linear, good data points may act as outliers and cause distortion in the 
plot. Partial residual plots based on highly efficient robust estimates are 
presented. In the simulated data sets explored in this paper, the robust 
based partial residual plots are insensitive to the outlying observations 
leading to a much easier identification of the unknown functions than 
their LS counterparts. In the sensitivity study presented, these robust 
based partial residual plots do not become distorted in the presence of 
outliers but they maintain their focus, enabling the identification of g. 


Key words: Linear model, M-estimates, outlier, regression diagnostics, 
R-estimates, rank based methods. 
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1 Introduction 


Partial residual plots are one of the most useful graphical procedures in the 
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exploratory fitting of data sets. These plots are quite simple. Consider a 
model of the form y; = a + 8)xi + g(xi2) + €i, where the function g(x) is 
unknown. Then (the first-order) partial residuals are the residuals of the 
fit of the misspecified model y; = a+ 3x;1 + Goxjg + e; added to the fitted 
part B2xi2. The plot consists of these partial residuals plotted versus zv. 
This plot is often informative in the identification of the unknown function 
g(x). | 

Partial residual plots were proposed by Ezekiel (1924) and have been dis- 
cussed by numerous authors. Larsen and McCleary (1972) gave the name 
partial residual plot to this procedure. Mallows (1986) extended these 
first-order plots to higher orders, the so-called augmented partial residual 
plots; Mansfield and Conerly (1987) considered informative algebraic rep- 
resentations of partial residuals; Cook (1993) obtained further theoretical 
underpinnings of these plots and proposed an extended class, the CERES 
plots; and Berk and Booth (1995) compare partial residual plots with sev- 
eral other diagnostic plots in a series of interesting examples. Based on 
work such as this, partial residual plots have become an important tool in 
data exploration. 

Most of the discussion of partial residual plots is based on the traditional 
least squares (LS) fitting of models. Partial residuals, though, are simply 
residuals added to the fit of the misspecified part. Hence, fits other than 
LS can be considered. McKean and Sheather (1997) developed properties 
of partial residuals based on robust fitting. They showed that the expected 
behavior of the resulting robust partial residual plots was similar to that 
of the LS partial residual plots. Furthermore, they showed that the robust 
partial residual plots were not as sensitive to outliers as the LS based plots. 

To determine which robust estimates to use, note that the function g(x) 
is often a nonlinear function. Hence the employed fitting criteria should be 
able to detect and fit curvature. We have selected highly efficient M and 
R estimates as the basis of our fitting criteria. These estimates and their 
residuals have been shown to behave similar to their LS counterparts in de- 
tecting and fitting curvature on good data, while being much less sensitive 
to LS procedures on data containing outliers in the Y-space; see McKean, 
Sheather and Hettmansperger (1990, 1993, and 1994). These fitting criteria 
are based on minimizing convex functions; hence, the consistency theory 
developed by Cook (1993) for LS partial residual plots extends to these ro- 
bust partial residual plots. Also these highly efficient robust fitting criteria 
are computationally fast and available. 

In this paper, we explore several data sets using robust based partial 
residual plots. In many of these data sets, outliers distort the LS based 
plots to the point where the identification of the unknown function g(x) is 
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impossible. In one of the data sets, due to the nonlinearity of the function 
g, good data acted as outliers and distorted the LS based partial residual 
plot. The robust based partial residual plots, though, are not sensitive to 
the effect of the outliers. These plots clearly identify the unknown function 
g. The sensitivity study in Section 5 shows the distortion of the LS based 
partial residual plot in a sequential fashion as a few points become increas- 
ingly outlying. The robust based partial residual plots, however, retain 
their “focus” under the increasing influence of the outliers. 


2 Notation 


This paper considers partial residual plots based on robust estimates. As 
discussed in Section 3, these plots are often used to graphically determine 
unknown functions of predictors. These functions are often nonlinear so fit- 
ting procedures which can detect curvature are of interest. Studies by Cook, 
Hawkins and Weisberg (1992) and McKean, Sheather and Hettmansper- 
ger (1993, 1994) have shown that high breakdown and bounded influence 
estimates have problems in detecting and fitting curvature, while highly 
efficient robust estimates are capable of detecting and fitting curvature. 
Hence, in this article we will focus on highly efficient robust estimates. To 
keep things simple, we have chosen the Huber M estimate and the Wilcoxon 
R estimate. Both of these estimates are widely available. But clearly other 
robust estimates, (other -functions and other score functions), can be used 
and will produce similar results. Similar to LS-estimates, though, the Huber 
and Wilcoxon estimates are highly sensitive to outliers in the x-space. This 
should be considered in exploring any data set prone to outliers in factor 
space. McKean, Naranjo and Sheather (1996a, 1996b) discuss diagnostic 
procedures that measure the overall difference between highly efficient and 
high breakdown robust estimates and determine cases where the fits differ. 

Consider the linear regression model y; = œa + XB +é, 7=1,...,n 
where x’, is the ith row of the n x p centered matrix X of explanatory 
variables defined here. The least squares estimates @ and Bzg minimize 


the dispersion 
n 


Drs(a,B) = X (yi — a — x8)’. (1) 


i=1 


Let g? be the common error variance. Under regularity conditions, the 
asymptotic distribution of the LS estimates is given by 


a a 1/n 0’ 
(asor E wir |) e 
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The regular (Wilcoxon) R-estimate By minimizes the dispersion 


Dw(B) = Ya(R(y: — x18) (v: — x,8) , 3) 


i=l 


where R(y; — x;3) is the rank of yj — x’, among y1 — x}G,..-,Yn — X,8 
and the scores a(i) are generated by the linear function 


elu) = VIB (u-5) , (4) 


as a(t) = y(t/(n+1)). Although we will be using Wilcoxon scores through- 
out this paper, the y notation will be useful. The function (3) is a convex 
function of 8 and Gauss-Newton type algorithms suffice for the minimza- 
tion; see Kapenga, McKean and Vidmar (1988). Note that (3) is invariant 
with respect to an intercept term. We shall estimate œ by the median of 
the Wilcoxon residuals, i.e., 


aw = med(y; — x;By) - (5) 


Estimating the intercept in this way, avoids unnecessary assumptions such 
as symmetric error distributions; see Hettmansperger, McKean and Sheather 
(1997). Our aim is to make as few assumptions as possible when concerned 
with data exploration. 

Under regularity conditions, the Wilcoxon estimates have asymptotic 
distribution 


a a T2/n 0’ 
Cae)" (()[%" warm |) © 


where 77! = v12 f f?(t)dt, (Jaeckel, 1972), Te = 1/(2f(0)), and f is the 
error density. Consistent estimates of T and 7, are presented in Koul, 
Sievers and McKean (1987) and McKean and Schrader (1983), respectively. 

Our third estimate will be Huber’s M-estimate 3 m Which minimizes the 
dispersion function 


Du(8) = ¥ E , (7) 
1I=1 


where go is a sacle parameter and p is given by 


a? /2 if |x| < h, 
aS l |æļh — h?/2 otherwise. (8) 
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The bend, the parameter h, must be set. In this paper we will take 
h = 1.345, the default setting used in Splus; see Becker, Chambers and 
Wilks (1988). Under regularity conditions By has an asymptotic normal 


distribution with asymptotic variance k7(4’¥)—1, where ¥ = [1, : X] and 


2 _ T Ely*(€i/o0)| 
(E[y" (e:/00)])? ’ 
where w(t) = p'(t); see Huber (1981). The constant of proportionality k? 
can be estimated by the usual moment estimators. 


3 Partial residual plots 


We will be concerned with models of the form 
Yi = a+ Bix + 9(x2i) + €i , (9) 


where x1; and x9; are p x 1 and q x 1 vectors of regression coefficients, x}, is 
the ith row of the n x p matrix Xj, and g(x) is an unknown function. The 
goal is to try to determine the function g as best as possible using simple 
graphic techniques. The partial residual plot described below is an attempt 
to obtain this goal. A recent overview can be found in the paper by Cook 
(1993). 

The description of the partial residual plot is the same regardless of what 
criteria is used to fit a model, so we will describe it generically by dropping 
the subscripts LS, W and M for the fitting criteria. Hence, let B denote an 
estimate of the parameter 3 in a model. We will use the subscripts when 
distinctions are necessary. 

In this article, we will only be looking at cases where the predictor 2x; 
is univariate. Let x2 = (X10,..-,2%n2)’. Since g is unknown we begin our 
exploration by fitting a first order model, (at the end of this section we will 
discuss fitting higher order models). Consider then fitting the model 


Yi = a+ BX; + Porting + €i . (10) 


Note that, unless g(xj2) = G2x%;2, model (10) is a misspecified model because 
model (9) is the correct model. We have indicated this in Model (10) by 
using e; instead of €; for the random error. 

Suppose we have fitted the misspecified model (10). Denote the fit by F; 
and let & = y; — Jı denote the residual. The partial residuals are defined 
by A 

& = È + Bota, (11) 
that is, the fit of the misspecified part is added back to the residuals. The 
partial residual plot is the plot of €} versus zx;2. 
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3.1 Discussion of partial residual plots 


As Cook (1993) noted, since & = y; — Ji, we can substitute the right side 
of equation ( 9) for y; and obtain 


€ = (a—@) + (81 — B,)'x1 + 9(xi2) + €i . (12) 


Although these estimates are based on a misspecified model, if they are 
close to their true values then the partial plot is close to a plot of g(xi2)+e; 
versus 272. 

Mansfield and Conerly (1987) considered the expectation properties of 
LS based partial residual plots by obtaining algebraic representations of 
the partial residuals using the true model distributional properties. Based 
on these representations, they showed, among other conclusions, that if the 
correct model was fitted then the expected partial residual plot should be 
a linear function of xv. They also showed that when xz and g are both 
orthogonal to X,, then we expect the partial residuals to be the unknown 
function g(x). On the other hand, if xp and Xj are highly collinear then 
there is little information in the partial residual plot. 

Using the first-order approximation theory for robust residuals and fit- 
ted values established in McKean, Sheather and Hettmansperger (1990, 
1993), McKean and Sheather (1997) obtained representations for the par- 
tial residuals when the true model is (9). Based on these representations, 
the conclusions described above of Mansfield and Conerly (1987) hold for 
the robust partial residual plots, also. 

McKean and Sheather (1997) further developed a measure of efficiency 
between the robust and LS partial residual plots. If the correct model is fit 
then as discussed above the partial residual plot is expected to be the linear 
function 3x2. Thus the plot of interest would be that of bwe2xi2 versus 7j2 
overlaid on the partial residual plot. Hence, it is the precision in the linear 
predicted equation BweXi2 of Baxi which is of interest in terms of efficiency. 
This relative efficiency measure is given by the usual asymptotic relative 
efficience between a robust estimate and the corresponding LS estimate. For 
example, if the Wilcoxon based residual plots are used then this asymptotic 
relative efficiency is given by 


E 
eW,LS = 73 > (13) 


where o? is the variance of the errors and 7 is defined in expression (6). If 
the error distribution is normal then ew zs = .955. However, if the error 
distribution is heavier tailed than the normal distribution then this ratio 
can be quite large; see Hettmansperger (1991). 
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In comparing the LS and robust representations of the partial residu- 
als, McKean and Sheather (1997) showed that the random part of the LS 
partial residuals has unbounded influence while the corresponding part for 
the robust partial residuals has bounded influence. One bad outlier, say 
Ei, not only distorts the ith residual but other cases, also, because the rep- 
resentation of the LS partial residual includes the unbounded term He, 


where H is the projection matrix onto the column space of [X; : x2]. On 
the other hand, this is not true of the robust partial residuals because in 
the respective, representational expansion of the robust partial residuals all 
terms are bounded. The examples and sensitivity study found in Sections 
4 and 5 provide illustrations of the distortion of LS partial residual plots 
due to outliers. 


3.2 Augmented partial residual plots 


The misspecified part of model, (10), is a first-order approximation to g(x). 
We can also crawl up the Taylor series expansion of g(x) to fit higher 
order polynomials. This was proposed by Mallows (1986) for second-order 
representations. In this case, we fit the second-order model, 


Yi = a+ BY Xii + Bora; + b313; + €i . (14) 
Now, the partial residuals are @& = €; + Box9; + B3x2.., where & are the 


residuals from the fit of model (14). Mallows called the resulting plot of 


êj versus zx; the augmented partial residual plots. Another plot of 


interest here is & versus 6229; + $32%,, because if the quadratic model is 
correct this later plot will appear linear; see the discussion above on the 
expected behavior of partial residual plots when the correct model is fit. 
Augmented plots are shown in Examples 1 and 3. Certainly higher degree 
polynomial approximations to g(x) can be handled in the same way as these 


quadratic plots. 


4 Examples 


In this section, we discuss several examples. We have chosen them to 
illustrate the exploratory behavior of the partial residual plots based on 
robust estimates and to show the sensitivity of the LS based partial residual 
plots to outlying observations. The data for all the examples is simulated, 
so at all times the correct model is known. There was little difference 
between the Wilcoxon based and the Huber based partial residual plots, so 
in a few examples only the results for the Wilcoxon based plots are shown. 
The Gauss-Newton type algorithm of Kapenga et al. (1988) was used to 
compute the Huber and Wilcoxon estimates. 
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Figure 1: Plots for Example 1: Pane A, Data Overlaid with Wilcoxon, 
LMS and HBR fits; Panel B, Partial Residual Plot of the Wilcoxon Fit; 
Panel C, Augmented Partial Residual Plot of the Wilcoxon Fit; Panel D, 
Augmented Partiel Residual Plot of the LMS Fit. 


Example 1 Quadratic Model 


The purposes of the first example is to show how the partial and aug- 
mented partial residual plots based on a robust fit behave for a simple 
quadratic model. It also shows why caution is necessary when considering 
partial residual plots based on high breakdown estimates. The generated 
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data follow the model 
Y; =0- zu +5.5|ajo| — 602, + éi , (15) 


where zx; are iid uniform(—1,1) variates, the €;’s are simulated iid N(0, 1) 
variates and the 2;9’s are simulated contaminated normal variates with 
the contamination proportion set at .25 and the ratio of the variance of 
the contaminated part to the non-contaminated part set at 16. This was 
similar to an example discussed in Chang et al. (1997). Panel A of Figure 1 
displays the scatterplot of the data overlaid by the Wilcoxon fit and two 50% 
breakdown fits: least median squares, LMS (Rousseeuw and Leroy, 1987), 
and a 50% high breakdown R estimate proposed by Chang et al. (1997), 
HBR. The LMS was computed using Stromberg’s (1993) algorithm. Note 
that the fit based on the Wilcoxon estimates fits the curvature in the data 
quite well while the 50% breakdown estimates miss the curvature. 
For the misspecified model 


Y; = at bizi + Bo|xi2| + €i , 


Panel B of Figure 1 displays the partial residual plot based on the Wilcoxon 
fit. The plot clearly shows the need to fit a quadratic model. Panel C shows 
the augmented Wilcoxon partial residual plot when a quadratic model was 
fit. This is a plot of the partial residual versus the fit of the quadratic 
part. If the correct model has been specified then, as noted above, this plot 
should show a linear pattern, which it is does. Panel D shows the same plot 
as Panel C except the fit based on the LMS estimates was used. Instead 
of a linear pattern, it shows a quadratic pattern, which is not helpful here 
because a quadratic model was fit. 


Example 2 Cook’s (1993) Nonlinear Model 


This is an example discussed in Cook (1993). The observations are 
generated by 


1 
i = ta +t +———~ , i=l,...,100, 16 
a a a (16) 


where z3;’s are iid uniform(1,26) random variables, x1; = 2 + 214, 223 = 
log T3; + Zoi, 21; has a N(0,.12) distribution, zo; has a N (0, .252) distribu- 
tion, and the z1;’s and the z9;’s are independent. A plot of the function 
g(x3i) = es versus £3; appears in Panel A of Figure 2. 
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Figure 2: Plots for Example 2: Panel A, Plot of g(x3) versus x3; Panel B, 
Partial Residual Plot of the LS Fit; Panel C, Partial Residual Plot of the 
Huber Fit; Panel D, Partial Residual Plot of the Wiloxon Fit. 


This is the function that the partial residual plots are attempting to 
identify. Panels B, C, and D display the partial residual plots based on the 
LS-, Huber and Wilcoxon fits, respectively. Note that the variable x;3 has 
been centered in these plots. The function g is identifiable from both robust 
residual plots, but g is not identifiable from the LS-plot. The points which 
distorted the LS partial residual plot are the points corresponding to the 
low values of x33. These points acted as outliers in Y-space and corrupted 
the LS fit, resulting in the poor LS partial residual plot. On the other 
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hand, the robust fits are much less sensitive to outliers in the Y-space than 
the LS fit; hence, these points did not corrupt the robust partial residual 
plots. As discussed in Section 3, one way of measuring efficiency in these 
plots is by the estimates of the constants of proportionality for the fitting 
procedure. For this example, these estimates are: ¢ = .0162, K = .0048, 
and 7 = .0053. Hence, the robust estimates are three times more precise 
than the LS estimates on this data set. 

Cook (1993) expands partial residual plots to the larger class of CERES 
plots. This procedure uses a nonparametric estimate of E(x; | £2) in place 
of the linear function (ox; in the regular partial residual plot in its con- 
struction of a partial residual plot. For this data set, as shown in the article 
by Cook, the procedure worked well with the LS fit. Similar plots could be 
developed based on robust fits, but for this example they are not needed. 


Example 3 Berk and Booth’s Model 


This is an example discussed in Berk and Booth (1995). The first-order 
partial residual plots fail on this example for all three fits. We include it, 
to show the importance of the augmented residual plot. 

The values for the z;2’s are the 100 values: —.99,—.97,...,.99. Then 
zi is generated as z; = a + .052z;1, where z;, are iid standard normal 
variates. The responses are generated by 


Yi = ig + -lzi2 , (17) 


where 2;2’s are iid standard normal variates and are independent of the 
zis. In this example, g(z;2) = z% and Panel A of Figure 3 shows a 
plot of it versus xg. This is the function that the partial residual plots are 
attempting to identify. Panels B, C, and D display the partial residual plots 
based on the LS-, Huber and Wilcoxon fits, respectively. Note that none 
of them identify the function g. This is hardly surprising. The generating 
equation for x; is a strong quadratic in xj, there is little noise. Fitting 7; 
stole the “clout” of xj2. Also, the quadratic function g is centered over the 
region of interest. In its Taylor series expansion about 0, the linear term 
would not be important; hence, the inclusion of x;2 as linear will not help. 
If we crawl up the Taylor series expansion to include a second-order term 
then both of these conditions are alleviated and the (augmented) partial 
residual plot should identify the quadratic function. This is the case as 
demonstrated by Panels E and F of Figure 3, which are the augmented 
partial residual plots of the LS and Wilcoxon partial residuals versus the 
quadratic fit. The linearity of these plots indicate that the appropriate 
model has been fit. 


202 Joseph W. McKean and Simon J. Sheather 


Panel A Panel B 


LS Partial Residuals 


Y 
02 00 02 04 06 08 1.0 
03 -02 01 00 01 


-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0 
x2 Linear Fit of x_2 
Panel C Panel D 


0.2 


Huber Partial Residuals 


-0.3 -02 -01 0.0 0.1 
Wilcoxon Partial Residuais 


0.3 -02 -01 00 01 


-1.0 0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0 
Linear Fit of x_2 Linear Fit of x_2 
Panel E Panel F 


LS Partial Residuals 


02 00 02 04 06 08 1.0 
Wilcoxon Partial Residuals 
02 00 02 04 06 08 1.0 


0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 
Quadratic Fit of x_2 Quadratic Fit of x_2 


Figure 3: Plots for Example 3: Panel A, Plot of g(x3) versus x3; Panel 
B, Partial Residual Plot of the LS Fit; Panel C, Partial Residual Plot of 
the Huber Fit; Panel D, Partial Residual Plot of the Wilcoxon Fit; Panel 
E, Augmented Partial Residual Plot of the LS Fit; Panel F, Augmented 
Partial Residual Plot of the Wilcoxon Fit. 


5 Sensitivity Study 


The following sensitivity study serves to illustrate the distortion of LS based 
partial residual caused by outliers in the Y-space. As our baseline model 
we consider a cubic polynomial in x;2 with normal errors. We chose z; to 
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be uniform(—1,1) variates. The model is 
yi = 0-2 +5.52%, — 62% + éi ii =1,...,40, (18) 


where é; are iid N(0,1) and z;2 are generated from a contaminated normal 
distribution. The misspecified model is 


Yi = A+ Py ri + Borin + €i . (19) 


In this setting, the partial residual plots should easily show that a cubic 
needs to be fit. This is confirmed by the top row of Figure 4 which are the 
partial residual plots based on the LS and the Wilcoxon fits, respectively, 
when the misspecified model, (19), is fitted. 

Next, in a series of four stages we distorted the values of three of the 
responses, as shown in Table 1, from small to large changes of these values. 
We then obtained the partial residual plots based on the LS and Wilcoxon 
fits for each of these stages. 


Table 1: Successive changes to the response variable for Model. 


Original Stage 1 Stage2 Stage 3 Stage 4 


3] -.067 0.067 10.56 100.56 1000.56 
62.44 310.44 -620.44 -620.44 -6200.4 
67.24 -335.2 -670.2 670.2 -6700.2 


Column A of Figure 4 shows the effect these changes had on the LS 
partial residual plot. Note that limit on the vertical axes were changed 
so that the bulk of the cases could be plotted. The distortion is obvious. 
From a clear cubic pattern for the baseline model (the top row of the plots) 
the LS based partial residual plot becomes more and more distorted as the 
successive stages are fitted. The clear cubic pattern has been lost even in 
the first stage (the second row of the plots). By the second stage (third 
row) the cubic pattern is not identifiable. There is some linear trend in the 
third stage (fourth row), but in the final stage (last row) there is just noise. 
On the other hand, the cubic pattern is clearly identifiable in the robust 
partial residual plots (Column B of Figure 4) for all stages. 


6 Conclusion 


LS partial residual plots are an important diagnostic tool for exploratory 
fitting. They are often used to identify unknown functions of the predictors. 
They are, however, vulnerable to the effect of outliers. One large outlier 
can severely distort the LS based partial residual plot, making the iden- 
tification of the unknown function of the predictor difficult to impossible. 
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Furthermore because the function g can be nonlinear, good data can have 
the same effect on the LS partial residual plots as outliers; see Example 2. 


Column A: LS Partial Residual Plots Column B: Wilcoxon Partial Residual Plots 
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Figure 4: Plots for Sensitivity Study : Column A, LS Partial Residual 
Plots for Original Data Followed by the LS Plots for Stages 1-4; Col- 
umn B, Wilcoxon Partial Residual Plots for Original Data Followed by 
the Wilcoxon Plots for Stages 1-4. 


In this paper, we have presented partial residual plots based on robust 
estimates. As the examples and sensitivity study demonstrated these par- 
tial residual plots are effective in exploratory fitting. Furthermore they 
are not vulnerable to the effect of outliers as their LS counterparts. Also 
for highly nonlinear situations such as Example 2, they are able to easily 
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identify the unknown function. As the sensitivity study shows, even in 
the presence of severe outlying observations partial residual plots based on 
highly efficient robust estimates are able to retain their focus, making the 
identification of the unknown functions possible. 
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Abstract: For model selection the Bayes factor is not well defined when 
using default priors since they are typically improper. To overcome this 
problem two methods have recently been proposed. These methods, in- 
trinsic and fractional, are studied here as methods to producing proper 
prior distributions for model selection from the improper conventional 
priors for estimation. For nested models, fractional priors are here de- 
fined and a comparison with intrinsic priors introduced by Berger and 
Pericchi is carried out. Robustness of the Bayes factor as the prior varies 
over the classes of intrinsic and fractional priors, is studied. Some illus- 
trative examples are provided. 
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1 Introduction 


Suppose that two models Mı and Mə are proposed to describe the data 
z = (£1, X2,...,%n). Under model M; the data are distributed as f;(z|6;), 
and the prior distribution for 6; is 7;(0;), i = 1,2. The Bayesian way to 
compare the two models consists in computing the posterior odds 


Pr(Mo|z) j Ma) 


Pr(Milz) Pr(Mı) 


Thus, the Bayes factor B21 (z) encapsulates all what the data have to say 
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about such a comparison. This Bayes factor is given by 


Jo, f2(z|92)72(62)d02 


a Jo, Fi(z|O1)71 (61) dO1- 


To subjectively elicit the priors 7;(0;), i = 1,2, is most cases a very 
dificult task. A way to alleviate this task is to elicit instead of a single 
prior for each model, a class of prior distributions T = {m1 (01), 72(62)} that 
maintains the features of the priors on which we are confident. Hence, the 
Bayes factors as the prior ranges over I’ takes now values in the range 


(inf Bo, (2), sip Ba (z)). 


This range is generally too large and it typically gives infr B21(z) = 0, so 
that there are priors in I favouring Mı and also prior favouring M2. Hence, 
it does not allow to decide which of the model is supported by the data. 

Another way to deal with the problem of model selection is to set as 
Ti(0;) the conventional prior for estimation of 6;, say nÀ (0;), which typically 
is improper, that is the integral fo, mi (0;)d6; diverges. This means that no 
normalization of nA (6;) is possible so that it is defined up to an arbitrary 
multiplicative constant. This implies that Bo;(z) is defined up to a ratio 
of unspecified constants. 

There are in the literature several ways either to specify the constants 
or to remove them from the analysis, see Akaike (1973), Schwarz (1978), 
Spiegelhalter and Smith (1982), O'Hagan (1995), Berger and Pericchi (1995, 
1996), among others. For a recent review, see Kass and Raftery (1995). 

In this paper we focus on intrinsic and fractional methodologies as meth- 
ods to producing proper prior distributions for model comparison. This is 
motivated by the fact that while there are methods to produce prior for 
estimation that work reasonably well, there is a lack of such a methods for 
model selection. 

Let us briefly summarize the intrinsic and fractional Bayes factors. The 
intrinsic Bayes factor (IBF) was proposed by Berger and Pericchi (1995, 
1996). This is a partial Bayes factor based on a minimum training sample, 
say x(l), which is a minimal subsample of the sample z such that 0 < 
Jo, fi(x(I)|0;) 1% (0;)d0; < œ, i = 1,2. This part of the sample is devoted 
to convert nÀ (6;) into nÀ (0;|x(1)), which is now proper, and the rest of the 
data is devoted to construct the intrinsic Bayes factor for model comparison, 
using the mA (0;|z(l)) as priors. Thus the partial Bayes factor is defined as 


Ba, (x(—()|x(1))) = Boy (2) Bra(2(0)) 
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where B3\ (z) is the Bayes factor for the improper priors nN (C2). Sb. 
and sample z. Note that B4; (x(—(1)|x(1))) does not depend on the arbitrary 
constant involved in the improper priors nA (0;), i = 1,2. To avoid the de- 
pendence of the partial Bayes factor on the particular x(1), they introduced 
an average on the set of all training samples. Thus, the arithmetic intrinsic 
Bayes factor is defined as 
1 L 
B3i (z) = Bu (z)> 9) Bry(x(!)), (1) 
[=] 
where L is the number of training sample contained in z. 

The fractional method, proposed by O‘Hagan, considers the fractional 
Bayes factor (FBF) which, as the author motivates, is defined by anal- 
ogy with the partial Bayes factor to avoid the arbitrariness of choosing a 
particular training sample. The FBF is defined as 


Jo, falala) mi (1)494 
Jo, fa(z|O2)°75" (02)d02 


where b is a constant that depends on the sample size n. Notice that Bb, (z) 
does not depend on the arbitrary constants involved in the improper priors. 

Both, the IBF and the FBF, contain BẸ (z) as a common factor. The 
other factors appearing in the right hand side of (1) and (2) can be con- 
sidered as the correction term of BÅ (z) to avoid the dependence of the 
unspecified constants. These correction terms are different. Furthermore, 
while the IBF correction term is completely specified, the FBF correction 
term depends on b that has to be assessed. We will go back on this topic 
in Subsection 2.4. 

A crucial property of the intrinsic method is that it is capable to gen- 
erating proper priors. These priors are derived by imposing that the IBF 
is asymptotically equivalent to an actual Bayes factor for the so-called in- 
trinsic priors (see Berger and Pericchi, 1996, for the definition of intrinsic 
priors and Moreno, Bertolino and Racugno, 1996, for a characterization). 
In some sense this guarantee that the IBF is a truly Bayes factor. 

The question is if something similar can be said on the FBF. In Subsec- 
tion 2.2 we produce a functional equation to derive fractional priors that 
enjoy the same spirit of the intrinsic priors. The solution of this equation is 
also discussed. Computation of the Bayes factor for intrinsic priors entails 
a robustness issue since they are not unique, see Subsection 2.1. It will 
be also shown that the solution to the fractional equation is not unique 
so that a similar robustness issue appears here. Robustness is studied in 
Subsection 2.3. In Section 3 some illustrative examples are given. Section 
4 contains some conclusions. 


By, (z) = Bai (2) (2) 
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2 The intrinsic and fractional priors for nested 
models 


Suppose that the two sampling models under comparison {f1(z|1), 61 € 
O1}, {fo(x]02), 82 € Oo} are nested. This means that the following condi- 
tions are satisfied, 


(i) O1 C Or, 
(ii) f2(x|@2) = fı(xlðı), for bə = 0. 


Let 7;¥(0;), i = 1,2, the improper priors chosen. 


2.1 The intrinsic priors 


The intrinsic priors are shown to be, see Moreno, Bertolino and Racugno 
(1996). any pair (71(61), 72(02)) such that, 


(a) 7(8,) is any prior in the class 
Pics Gane f m(0)dħ =1, f T(02) mı (4 (02))d92 = 1} 
1 2 


where T(62) = AE e BRED) and %ı(82) is the limit point of 


the MLE 6, (z) for parameter 6; in Mı when sampling from Mo at point 
Oo. 


(b) For each 71(6,;) € T1, m2(02) is given by 
772(02) = T(82)m1 (41 (82)). 


For studying robustness of the Bayes factor with respect to intrinsic 
priors it is convenient to express [] as 


Ty = {71 (61) : JA nı(01)dðı = 1, [ V (81) 771 (0; ) dé, = 1}, 


where V (81) = T (01) A { Jo,(0,) T(82)d02}1o; (01), ©2(01) = {02 = (O2 = 
©1) : Y1ı(02) = 01}, the y~1-coset of 6; in Og — Oj, and Ož is the y-image 
of Oo = Qj. 


2.2 The fractional priors 
Let us first precisely state what we mean by fractional priors. 
Definition 1 For the sampling models { f;(x|0;), i = 1,2} the proper priors 


71(01),72(@2) are called fractional priors if its Bayes factor and the FBF 
for nN (0;), i = 1,2, are asymptotically equivalent for some sequence {bn}. 
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The following condition is necessary in order to propose the fractional 
equation from which the fractional priors are derived. 


Assumption (A) The nested models {f;(x|9;), nA (6;), i = 1,2} satisfy 
Assumption (A) if for some sequence {bn} the limit in probability |[P,] of 
the correction term of the FBF is a degenerated random variable. In other 
words, under Assumption (A) there exists a function FM? (02), which could 
be a constant, such that 


Joy ninn G )d02 


For simplicity in notation the dependence of F,3?(02) on the sequence {bn} 
is not made explicitly. 


FM (62) = lim [Poe] 


Theorem 1 Under Assumption (A), the fractional priors ( 7(61), 72(42)) 
are the solutions to the functional equation 


772(82) Ti (1 (62)) 
Fig’ (62) = m (82) ™1(W1(62)) © 8) 


Proof: If we expand a) around the MLE 6;(z), the Bayes factor for the 

fractional priors 71(61), (02), can be approximated as 

Jo, f2(2102)7¥ (02) 2812) dd 

Jo, fi(z|01) ay (01) Fig 

a BNO) T2(82(2)) Ti (ôi (2)) 
ra (82(z)) mı(8ı(z)) 


Equating the limit in probability |P9,] of the fractional Bayes factor given 
in (2) with the limit in probability [P9,| of the above expression, we obtain 


Boy (z) = 


(1 + o(1)). 


12(0o(z)) a (ô (z 
FM (02) z lim [Po] a 2l )) Ti (81( )) 
no mz (82(z)) m (41(2)) 
where the left hand side follows from Assumption (A). Notice that we do 


not need to take limit in probability under model M4 since it is nested in 
Mə. This gives (3) and proves the assertion. O 


We remark that the fractional priors does not depends on the arbitrary 
constants involved in the improper priors. They cancel out in expression 


(3). 
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Corollary 1 Fractional priors are any pair (nı(01), 72(82)), where mı (01) 
is any member of the class 


I> = {n (01) : h 771(01)d0, = Lf S'(02) 71 (a1 (82) )dO2 = 1}, 


with 
Ta (02) 


S (0a) = Eia Oa) rg O 


and for each 7(6,) ET?2, 


T2(02) = S(02)T1 (Y1 (82)). 


Proof: Equation (3) can be written as 


T2(02) = S(82)T1 (Y1 (82)), 


where 7(6,) have to be a probability distribution such that 72(02) be also 
a probability distribution. This proves Corollary 1. O 


It is convenient to rewrite class I, as 
I> = {m (01) : JA nı(01)dðı = ff H(01)11(0;)d0, = 1}, 
1 1 


where 


H (6) = S(6,) + JA 5(¢2)a| 16» (6, (01). 


2(01) 


2.3 Robustness of the Bayes factor for intrinsic and 
fractioral priors 


From Subsection 2.1 it follows that the Bayes factor for the intrinsic priors 
(771 (81), T2(82)) is 
Joa f2(2|02)T (02) Tı (Y1(82))d02 
Jo, Fi:(2181)71 (41) dA 

which is written only in term of 7(6}). 

It is easily shown that Bo1(z), can be expressed as 
E Jo, W (z; 81) nı(01)dðı 

Jo, filz|01) mı (81)d01 


Bo (z) 


Boi (z) 
where 71 (01) ET, and 


W (z; 01) = fo(z|@1)T (01) = i. fo(z|02)T (02)dð2} los (01). 
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On the other hand, from Corollary 1 it follows that the Bayes factor for 
fractional priors B21(z) can be written only in term of mı(01) € T2 as 


Boy (2) = [22 4242182) 5(G2)r i (82))ab2 
Jo, Fr(2l01) 71 (01) dO, 


It is straightforward to show that this Bayes factor associated to frac- 
tional priors can be written as 


Boi (2) = Jo, F1(2|91) 71 (01) dO, ’ 


where 


G(z; 0; ) = fo(z|01)S(01) + I (6 flea )S(0s) aa} Lo«(9,)(91). 


2 


Global robustness of the Bayes factors Bg1(z) and Bg1(z) as mı ranges 
over I, and Ig respectively, can be established by computing the ranges 


(inf; er, Bai(z), sup,,er, B21(2)), (infrer, Bai(z), SUPrm er, Bai (z)). This 
involve a moment problem for which Theorem 2 summarizes the solution. 


Theorem 2 The infimum of the fractional Bayes factor as the priors range 
over the class of priors T2, say \ = inf Bo1(z), is the unique solution in 
— mel 


to the equation 


sup inf [G(z;01) —A_fi(z|@1) + d(1 — H(@1))] = 0. 

dER1€01 
The sup is obtained by interchanging in the above expression sup with inf. A 
similar statement can be given for the Bayes factor for the intrinsic priors. 


Proof: The proof follows by using the linearization algorithm, see for 
instance Lavine, Wasserman and Wolpert (1993), and the so-called Gen- 
eralized Moment Theory, Kemperman (1987), Salinetti (1994) and Liseo, 
Moreno and Salinetti (1996). 

For data z, robustness of a Bayes factor Boi(z) as the priors range over 
a given class I, is strictly achieved if either supper Bai(z) < 1 favouring 
My, or infer By (z) > 1, in which case Mo is favoured. With obvious 
adaptations of the suggestion by Jeffreys (see Kass and Raftery, 1995) we 
would take the following interpretation: 


if 1 < infrer, Bai(z) < /10, the evidence against Mı is small, 

if V10 < infer, Boi(z) < 10, the evidence against Mı is substantial, 
if 10 < inf,,cr, Boi(z) < 100, the evidence against Mj is strong and 
if 100 < inf,,cr, Boi(z), the evidence against Mı is decisive. 
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2.4 The role of the sequence {b,,} in the fractional priors 


An important difference between the intrinsic priors and the fractional pri- 
ors is that while the former are derived automatically from the specification 
of the models { f;(x|6;), n (0;),i = 1,2}, the latter needs in addition to as- 
sess the sequence {bn}. In fact, for producing fractional priors we already 
have a restriction on this sequence since Assumption (A) has to be satis- 
fied. Nevertheless, this does not guarantee the uniqueness of the sequence, 
so that an additional convention has to be imposed. Let us illustrate the 
assertion with the following simple example. 


Example 1 Consider the nested models 
Mı . fi(z|1) = N(x|01,1), Tı(0ı) = lto} (81), 
Mp : fo(a|2) = N(2|O2,1), T2 (02) « 1r (62), 


that is, we are testing that the mean of a normal distributions is 0 versus 
it 1s different from 0. Notice that the prior for the first model is proper and 
the prior for the second is the conventional uniform improper prior. 


The intrinsic priors for this models can be shown to be the unique pair 
71 (91) = 149}(61), 72(62) = N(62|0, 2). 


On the other hand the fractional priors are derived as follows. For a 
given sequence {bn}, we have 


Jo, F2(z|02)'" 115" (62) d62 


abe 


nb, nī 
== lim [Poo] om exp{—bn -7 \. 


De (62) lim [Poo] 


where Z = DD xi. The sequences {bn} proposed by O ‘Hagan (1995) are 
AEE TEE P TES 
n 


p = 4/0 
eee 
n 


? 


n 

The sequences {b,,},{b,,} do not satisfy Assumption (A), so that they do 
not produce fractional priors. Therefore, we are left with sequences of the 
form {bn = T2, mo = 1,2,...,n — 1}, for which we obtain 


,/Mo 62 1 
Fy3? (62) = Jon exp{—mo>} = N (60, re 
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Thus, the fractional priors are 
1 
(81) = 149}(91), 72(A2) = N (0210, mae mo = 1,....n—1. 


Note that 72(62) is not unique but depends on mg. The convention we will 


take is to fix mo as the minimal training sample size (Berger and Mortera, 
1995). In this case mp = 1. For this assessment the fractional prior for Mə 
is the density N(6|0, 1). 


3 Examples 


Let us illustrate the behaviour of the Bayes factor for intrinsic and fractional 
priors for two standard problems. The first is one sided testing on the mean 
of a normal distribution and the second is a two sided testing. 


Example 2 One sided testing. Let X be a random variable N(z|0, 1) 
distributed. Suppose that we are interested in testing Hı : 0 > 0 versus 
Hy:6<0. 


A formulation of this testing problem in a nested context would be to 
compare the two models 


Mı R filli) = N (xl, 1), T” (01) OC Li0,00) (81), 
Mo : fo(x|62) = N(2|6o, 1), 1 (62) © 1 (268.66) (02); 


where the priors are the standard improper prior for location parameter. 
it is easy to see that yı (02) = 921 10,00) (82). 


The class of intrinsic priors is 
r,={ | (£1) (0,;)d0, =1 : EA EA N 
= {7}: —)7 =1—-—,0<k<- 
1 1 A A2 1\Y1 1 Ja 9 


where k is the value of mı (01) at the discontinuity point 6; = 0 (see Moreno, 
Bertolino and Racugno, 1996) and ®(6;) is the standard cumulative distri- 
bution function at point 01. Thus, for a given sample (z,n) and mı(01) ET, 
the Bayes factor for Mz against Mı can be shown to be 


kN (E, n) + Jo? exp{ -2E e(r (61) 0, 


Bə (z,n) = - 
21(2, n) h exp{ -2E yr, (0; )d6, 


where N(z,n) = fo. exp{ 2252) }0(%)dOo. In Moreno, Bertolino and 
Racugno (1966) it was shown that sup,,cr, Bai(Z,n) is infinity for any 
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sample point. Since for z > 0 we have that infr er, Boi(Z,7) is less than 1 
we conclude that the Bayes factor is not robust with respect to the intrinsic 
priors. 

A limiting procedure was In Moreno, Bertolino and Racugno (1966) 
introduced to overcome the lack of robustness. This procedure is based on 
the fact that when the prior for the simple model Mı is proper, then the 
intrinsic prior for the complex model always exists and it is unique. 

The idea is to take the restriction of n (01) on an increasing sequence 
{Cr} of subsets of ©; such that fo mj’(01)d0, < 00 and limn+soo Cn = O1. 
Then we construct the associated sequence of intrinsic priors for Mə and 
we take the limit of the corresponding sequence of Bayes factors. Under 
rather general conditions the limit was proved to be independent on the 
particular sequence {Cn} we have chosen. 

Applying this procedure to our example the limiting Bayes factor turns 
out to be 


Bo (Z,n) = ————-. 
Me) = BEV 
On the other hand, the fractional priors for this problems are the fol- 


lowing. For sequences {bn} of the form b,, = vm and b, = ogn, 


E fi (2101)? n] (01)d0, 
n—soot 92! Jo, Ja(2l02)tr nY (02)db2 


FE? (02) 


| 
n 
© 
8 
~~ 
m~ 
D 
No 
VS 


so that 
772(02) = T1 (82) 110,00) (92), 


where 7(01) is any probability density on [0,00o). Therefore, the Bayes 
factor for any data z and any fractional prior turns out to be 


Bo, (z) =a 


Therefore, the above sequences {b,,}, { b,,} produce fractional priors 
that gives a non sensible Bayes factor. 

If we choose bn = Te with mo = 1, the minimal training sample size, 
the class of fractional priors turns out to be 


ra(82) = F= looo) (82) + #(02) 1 (02) Loo) 02) 


where 77; is any prior in the class 


EART AEE o<k< 3) 
n ={m: f (01)nı(01)dð1 = Ta l SF Sah 
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k being the value of mı(0ı) at the discontinuity point 6, = 0. The Bayes 
factor for mı € T% is given as 


Boi (, n) = : 
n(r— fore) n(£Z—O, )? 
k Jo, exp(— "2 )& (62) do + fo? exp- E)E (1) m1 (01) 401 
n(Z—O;)2 i 
I5° expl- 2E ) m1 (61) dO, 


It can be seen that Boy (Z,n) is not robust as the prior ranges over class 
T2. To overcome this lack of robustness we can apply the same limiting 
procedure considered for the Bayes factor for intrinsic priors. A difficulty 
we find here is that it is not necessarily true that for a proper prior for the 
simple model and improper for the complex, the corresponding fractional 
prior for the complex model is proper. Indeed, if 71(8,) is a proper prior 
for Mı and nf (02) the improper prior for Mg, the corresponding fractional 
prior for Mo is given as 


Jo, fi(Z|01)"" mı (01) dO; 


12(02) = T3 (82) rin [Poa re (210) ona Oda 


which is not necessarily a probability density. 

Fortunately, if in this example we take as 7(0)) = pala) (01), the re- 
striction of nÀ (01) to the interval (a,b), and bn = + then the corresponding 
fractional prior for model Mo is 


12(02) = z= (8% — 62) ~ O(a ~ 62)), 


which is a probability density for any values of a and b. The Bayes factor 
for this fractional priors is 


oe) n(Z—O2)? 
BED (a, n) = Lo IPC EANO — 02) — B(a~ 02) ph 
21 f — ora) n(Z—0} )? i 
fo° exp- E) doy 


The limit when a — 0 and b — œo results 


~ ~ (a,b) (FE) 
Bo (z,n) = dim Bo’ (2) = — = 


(zyn) 
which is very close to the limiting Bayes factor for intrinsic priors. 


We have already said that the intrinsic methodology always generates 
a class of proper priors distributions (intrinsic priors) for model selection 
when the models are nested. The fractional methodology, however, not 
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necessarily generates proper priors. The following example illustrates this 
assertion. 


Example 3 Two sided testing. Suppose we have to choose between the 
following two nested models, 


1 
M; : fı(z|01) = N(z|0, 07), ni (01) « z, 10)(01), 
1 
Mp : fo(2|02) = N(x|u, 02), T2 (82) x ZZ LRx(0,00) (4, 72). 
09 


The parameter spaces are respectively 0; = 0 x Rt and © = R x Rt 
and the improper priors are given by the Jeffreys rule. It is easy to see that 
h1 (02) = y + 09 

For the data z = (21, £2, ..., £n) and a given sequence {bn }, some algebra 
shows that 


Jo, Fi(z|01)°" ay’ (01) dO 


Fy3?(02) = lim [P WA ay a Pe eT 
2( 2) lim | al fo(z|@) o> nd (0 2)d05 
Vnbn s? a 
= lim [P.|—= | = 
Vin \s* +T 
where s = + S\7(a;—Z)* and Z = 4 1 Ta For sequences of the form {b,,}, 


{b}, anp (A) is satisfied, ‘but F.5?(02) = 0 and consequently no 
proper prior is fractional. 
For sequences of the form {bn = 72, mo = 1,2,...,n — 1}, 


mo /2 
o a / MO o2 / 
n= Jin \ oF + p? 


and thus the class of fractional priors is (1(61), 72(02)), where 


_ vim ope? 
72(02) = Jon E a 2) a= 1(w1(82)), 


and 7(9;) is any prior in the class 


I> {n (81) p 


ü ~2 
1, | milo) A ET omdoos ik 
0 


= | 
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It is easy to see that this class is empty for any mo > 1. In particular 
for mo = 2, the minimal training sample size for this problem, we have 


OO CoO 1 
To = {n1 (01): | 7™1(01)do, = 1, | 1™1(01)do, = Tae 
that is clearly empty. 

Therefore, the sequences {bn} suggested in the literature of fractional 
methodology do not generate proper prior distributions for this two sided 
testing problem. 

However, the class of intrinsic priors is 


Pee | neisi n — m (0ı)dh, so, 


that is not empty. 


4 Conclusions 


In this paper we have considered the intrinsic prior distributions (Berger 
and Pericchi, 1995 and 1996), and introduced the notion of fractional priors. 
This permits to focus the intrinsic and fractional methodologies as tools 
for generating proper prior distributions for model comparison from the 
conventional improper priors for estimation. The considered models have 
assumed to be nested and the main conclusions are: 

The intrinsic priors always exist and form a class given by generalized 
moment constraints. The Bayes factor for intrinsic priors is not necessarily 
robust, but the limit intrinsic procedure (Moreno, Bertolino and Racugno, 
1996) solves this lack of robustness. 

The fractional methodology, however, not always generates proper pri- 
ors (fractional priors). When fractional priors there exist, we have found 
that the sequence {b, = 72, n > 1} with mo equal to the minimal train- 
ing sample size, is the appropriate selection among those recommended by 
O‘Hagan (1995). The associated Bayes factor is then very close to the Bayes 
factor for intrinsic priors. Furthermore, calculations are quite simple. 

Extension of this theory to non-nested models is an interesting topic 
that deserves more research. It is a work in progress that will be formalized 
elsewhere. 
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Abstract: Sliced inverse regression (SIR) was introduced by Li (1991) and 
Duan and Li (1991) as a dimension reduction technique that determines 
the number of linear combinations of the predictor variables needed to 
obtain a parsimonious regression model. It is well known that SIR is 
not robust to the effects of outliers nor can it always detect symmetric 
dependence. In this paper, we briefly outline another technique based on 
inverse regression which potentially overcomes these shortcomings of SIR 
in an important special case. Finally, we compare the effectiveness of the 
new technique with that of SIR on some real data sets. 
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1 Introduction 


Regression analysis is arguably one of the most widely used statistical tech- 
niques. A regression model expresses the mean of a response variable y as 
a function, f, of an explanatory variable x, a p-dimensional column vector. 
Traditional parametric regression methods assume that the functional re- 
lationship between y and x is known apart from some parameters, which 
must be estimated. When the assumed functional form is correct, a vari- 
ety of methods (including least squares and robust methods) can be used 
to estimate the unknown parameters. However, in many applications any 
parametric model is at best an approximation to the true one and the search 
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for an adequate model becomes increasingly difficult as p, the number of 
predictors, increases. 

Alternatively, nonparametric regression methods estimate the regression 
function without assuming a particular functional form. Recently, Wand 
and Jones (1995), Fan and Gijbels (1996) and Simonoff (1996) have pro- 
vided comprehensive accounts of this field. Many nonparametric regression 
methods are based on the notion of local smoothing, that is, the estimate of 
f at any point of interest is based on a smoothed version of y in that region. 
Thus, the success of local smoothing depends on the existence of sufficiently 
many data points around each point of interest in the design space to pro- 
vide adequate information about f. As the dimension of x increases, larger 
and larger sample sizes are needed in order to ensure that there are suf- 
ficient data points around each point of interest. This problem has been 
appropriately referred to as the curse of dimensionality (Bellman, 1961). 
Hastie and Tibshirani (1990, pp. 83, 84) provide a simple, yet effective, 
example of this phenomenon. A number of approaches have been proposed 
to cope with the curse of dimensionality. Additive models (see Hastie and 
Tibshirani, 1990, Chapter 4) approximate f as the sum of nonparametric 
univariate functions of each of the p predictors. Alternatively, sliced in- 
verse regression (Li, 1991 and Duan and Li, 1991) is a dimension reduction 
technique that does not rely on a complicated model-fitting process. Sliced 
inverse regression (SJR) determines the number of linear combinations of 
the p predictors needed to obtain a parsimonious model for f. 

In the next section, we briefly outline a dimension reduction technique 
based on inverse regression. Finally, in Section 3 we compare the effective- 
ness of this new technique with that of SIR on some real data sets. We 
show that the new technique potentially overcomes two of the shortcomings 
of SIR, namely, a lack of robustness to outliers and a failure to detect some 
forms of symmetric dependence. 


2 Inverse regression methods for dimension 
reduction 


Consider the following general regression model 


y = f (1X, ---, AEX, €), (1) 


where f is an unknown arbitrary function, x is a p-dimensional vector 
of predictors, and € is an n x 1 vector of errors which is assumed to be 
independent of x. The integer k(< p) is the number of linear combinations 
of the predictors that are needed to summarize the dependence of y on 
x. Li (1991, 1992), Cook and Weisberg (1991) and Schott (1994) provide 
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methods for determining the value of k based on sliced inverse regression. 

The simplest nontrivial case occurs when k = 1, since then (}x contains 
all the information from x about y. If k > 2 then commonly the aim is to 
reduce k. One possibility, for example, is to seek extra explanatory vari- 
ables. These extra variables could include interaction terms, polynomial 
terms and/or dummy variables. Alternatively, the current set of explana- 
tory variables could be changed using transformations. Further discussion 
of this issue can be found in Cook and Weisberg (1994, Chapter 8). 

Thus arguably, an important special case is the decision as to whether 
k =1 or k > 1. In this paper, we focus on this problem. 

Under the assumptions that model (1) holds with k = 1 and that the 
distribution of x is elliptically symmetric, Duan and Li (1991) obtained the 
following result 


E(ajly) = E(aj) + yey) j=l,..-p. (2) 


This result means that for each predictor z;, the inverse regression function 
E(«;|y) equals the mean of x; plus some unknown function «(y) times the 
constant yj. A crucial aspect of this result is that x(y) does not depend 
on j. Hence, a graphical procedure to decide whether k = 1 or not is to 
examine the p plots with z; on the vertical axis and y on the horizontal 
axis to see if each plot has the same shape. Such a procedure is advocated 
by Cook and Weisberg (1994, Chapter 8). 

Sheather and McKean (1997) have developed two nonparametric meth- 
ods for testing whether k = 1 or not. The two procedures are based on the 
following observation. Suppose that model (1) holds with k = 1 and that 
the distribution of x is elliptically symmetric. Then, for 1 < i,j <p 


E(z;ly) — E(a;) | | Yj | 

n] = log ||| |, 3 
| E(zi\y) — E(ai) yi 8) 

which is independent of y. In practice, the left side of (3) can be replaced 


by 
Lij = log | — | . 
Ti — Ti 


Thus, a test of k = 1 against the alternative k > 1 can be obtained by 
testing for each combination of i and j whether L;i; is independent of y. 
Sheather and McKean (1997) have developed the following two tests. 


Tj — Tj 


e Test1: This test is based on dividing up each plot of L; j versus y into 
four quadrants and counting up the number of points in each quadrant. 
The quadrants are obtained by splitting both L; ;j and y into two groups 
depending on whether they are larger or smaller than their respective me- 
dians. If for a given i and j, Lij does not depend on y then we expect 
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n/A4 points to fall in each of the four quadrants. Departures from these ex- 
pected frequencies are tested using a x? goodness-of-fit test. Sheather and 
McKean (1997) have found that a significant result on Test1 may indicate 
that polynomial terms are missing from the regression model. 


e Test2: This test of independence is based on a runs statistic. In this 
case, the statistic used is the number of runs above and below the median 
value of L;j in each plot of Li; versus y. Sheather and McKean (1997) 
have found that a significant result on Test2 may indicate that interaction 
terms are missing from the regression model. 

For a more detailed description and discussion of Testi and Test2 see 
Sheather and McKean (1997). 


3 Examples 


In this section, we discuss several examples involving real data taken from 
Cook and Weisberg (1994). In each example, the aim is to decide whether 
k = 1l or k > 1 in (1), that is, whether one linear combination of the 
predictors can adequately summarize the dependence of y on x. The ex- 
amples have been chosen to illustrate that the inverse regression technique 
potentially overcomes two of the shortcomings of SIR, namely, a lack of 
robustness to outliers and a failure to detect some forms of symmetric de- 
pendence. 

The R — code software supplied with Cook and Weisberg (1994) was 
used to calculate SIR. In each example the default R — code settings for 
SIR were used. Rank based regression estimates were calculated using the 
experimental MINITAB command rregress. Once again all the default 
settings were used. 


Example 1 Ethanol Data 


The data consist of 87 observations obtained from an industrial experiment 
involving a one-cylinder engine using ethanol as a fuel. The response NOx 
is a measure of nitric oxide concentration in exhaust emissions. There are 
two predictors, Æ and C. E is the equivalence ratio, a measure of the 
fuel/air mixture while C is the compression ratio. 


SIR Testl Test2 
0.345 <0.0001 0.0003 


Table 1: p-values for testing k = 1 against 
the alternative k > 1 for model (1). 
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Table 1 summarises the results from SIR as well as tose from Test1 and 
Test2. As discussed in Cook and Weisberg (1994, pp. 125-127), SIR has 
failed to detect the symmetric quadratic dependence of E on NOz, which 
is obvious in plots. On the other hand, Test! and Test2 find very strong 
evidence that more than one linear combination of E and C is needed to 
adequately model NOz. In addition, these tests indicate that terms like 
higher order polynomial terms and interaction terms may be missing from 
the model. 


The following regression model was fit to the data using least squares, 
NOx =7+NC + 2E +730? + yE? +75C-E +e. (4) 


Table 2 summarises the results. The interaction between C and F is 
highly significant as is the quadratic term in E. The adjusted R? value for 
model (4) is 84.2% while the corresponding figure is 0.0% when the model 
without the quadratic and interaction terms is fit. 


Parameter Estimate ft-ratio p-value 


Yo -24.26 -15.702 <0.001 
~y 0.22 1.876 0.864 
y2 56.76 20.592 <0.001 
3 0.00 0.565 0.574 
y4 -29.79 -21.465 <0.001 
o ™%® 024 _ -3868 <0.001 


Table 2: Least squares fit of model (4). 


Example 2 Australian Institute of Sport Data 


The data were obtained from 102 male and 100 female athletes at the 
Australian Institute of Sport. Interest centers on modeling LBM (lean 
body mass) as a function of Ht (height in centimetres), Wt (weight in 
kilograms) and RCC (red cell count). Following Cook and Weisberg (1994, 
pp. 122-125) we shall analyse the data for the female and male athletes 
separately. 


Female Athletes 


Table 3 summarises the results from SIR as well as those from Test1 and 
Test2 for the data on the 100 female athletes. The p-values reported for 
Testl1 and Test2 are the minimum of those obtained from 3 pairwise tests. 
The row headed ‘1 point removed’ gives the results when the case corre- 
sponding to the largest value of LBM is removed from the data, while the 
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row headed ‘5 points removed’ gives the results when the 5 cases marked 
with a x in Figure 8.5 of Cook and Weisberg (1994, p. 125) are removed 


from the data. 


SIR Testi Test2 
All data 0.026 0.686 0.005 
1 point removed 0.229 0.836 0.008 
ö points removed 0.617 0.724 0.007 


Table 3: p-values for testing k = 1 against 
the alternative k > 1 for model (1). 


As discussed in Cook and Weisberg (1994, pp. 124-125), SIR is not 
robust to the effects of outliers. In this case removing just one of the 100 
data points produces a 10 fold change in p-value obtained from SIR. On 
the other hand, Test! and Test2 change little when a small number of data 
points are removed from the data. In addition, Test2 finds strong evidence 
that more than one linear combination of Ht, Wt and RCC is needed to 
adequately model LBM. In addition, this test indicates that interaction 
terms may be missing from the model. 

The following regression model was fit to the data using rank based 
regression, | 


LBM = yt At+y2Wt+73RCC+y4Hbt-Wt+75Ht-RCC+ yWt-RCC+E. 
(5) 
Table 4 summarises the results when model (5) is fit to all 100 data 
points. In this case, the interaction between Ht and Wt is highly significant. 
When the 5 points referred to in Table 3 are removed and model (5) is refit, 
the p-value for the interaction between Ht and Wt increases to 0.056. Thus, 
some but not all of the significance of this interaction term is due to these 
5 points. 


Parameter Estimate t-ratio p-value 
Yo -61.400 -2.567 0.012 
yı 0.719 2.958 0.004 
Y2 0.804 1.529 0.130 
Y3 -0.007 -0.134 0.894 
Y4 -0.006 -2.744 0.007 
V5 -0.056 -1.518 0.132 
Y6 0.169 1.754 0.082 


Table 4: Rank based regression fit of model (5) to all 100 female athletes. 
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Male Athletes 


Table 5 summarises the results from SIR as well as those from Test1 and 
Test2 for the data on the 102 male athletes. The p-values reported for 
Test1 and Test2 are the minimum of those obtained from 3 pairwise tests. 
The row headed ‘2 points removed’ gives the results when the cases corre- 
sponding to the two largest values of LBM are removed from the data. 


SIR Testi Test2 
All data 0.017 0.950 0.111 
2 points removed 0.173 0.831 0.108 


Table 5: p-values for testing k = 1 against 
the alternative k > 1 for model (1). 


These data also illustrate that SIR is not robust to the effects of outliers. 
In this case removing just two of the 102 data points produces a 10 fold 
change in p-value obtained from STR. On the other hand, Testi and Test2 
change little when a small number of data points are removed from the 
data. In addition, neither Testi nor Test2 finds strong evidence that more 
than one linear combination of Ht, Wt and RCC is needed to adequately 
model DBM. 

The following regression model was fit to the data using rank based 
regression, 


LBM = yt Ht+yWt+y3RCC+y4,Ht-Wt+ 75Ht-RCC+y7Wt-RCC +e. 

(6) 

Table 6 summarises the results when model (6) is fit to all 102 data 

points. In this case, none of the interactions are significant. When the 2 

points referred to in Table 5 are removed and model (6) is refit, once again 
none of the interaction terms are significant. 


Parameter Estimate t-ratio p-value 


yo -69.870 -1.072 0.286 
y 0.590 1.344 0.182 
‘9 0.716 1.903 0.060 
73 6.650 0.544 0.588 
YA -0.002 -1.070 0.287 
~s -0.062 -0.749 0.228 
‘6 0.058 0.997 0.321 


Table 6: Rank based regression fit of model (6) to all 102 male athletes. 


In summary, for the athletes data the nonparametric procedure of Sheather 
and McKean (1997) seems to correctly identify the case (i.e., the male ath- 
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letes) when a single linear combination of Ht, Wt and RCC adequately 
models LBM as well as the case (i.e., the female athletes) when more than 
one linear combination is needed. On the other hand, SIR indicates for 
both the female and male athletes that more than one linear combination 
is needed when all the data are used while it indicates the opposite when 
a small number of outliers have been removed from the data. 
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Abstract: Identification of curvature in regression models is an important 
aspect of data analysis. Partial residual plots have played a major role. 
Recently a new class of plots has been developed. They are called CERES 
plots and include partial residual plots as a special case. Implementation 
of these plots necessitates modeling the relationships between certain 
covariates. If these relationships are linear, a partial residual plot is 
obtained. However, if the relationships are nonlinear, the more general 
CERES plot is obtained. Generalized additive models (GAM) are another 
method for identifying and estimating curvature. Again, implementation 
of a GAM requires modeling the relationships between covariates and 
the response. Here, we motivate and describe key features of interactive, 
graphical methods which construct CERES plots and/or GAMs. 


Key words: Partial residual plots, CERES plots, Generalized Additive 
Models, XLISP-STAT, S-PLUS. 


AMS subject classification: 62G07. 


1 Introduction 


Conditional expectation residual plots (CERES plots, see Cook, 1993) and 
Generalized Additive Models (GAMs, see Hastie and Tibshirani, 1990) have 
been developed in the literature as diagnostic and modeling tools for re- 
gression analysis. These methods are designed to detect curvilinear rela- 
tionships between selected covariates and the response variate in regression. 
When used interactively, these methods can help detect outliers, give in- 
formation about possible heteroscedasticity. 

In this paper, we outline the basic theory and assumptions underlying 
CERES plots and GAMs. Using simulated data, we then illustrate how 
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these methods are used and implemented. Both of these methods rely on 
the use of scatterplot smoothers. The examples are intended to highlight 
the usefulness of an implementation which (1) shows the data and asso- 
ciated scatterplot smoothers and (2) has an interactive interface so that 
smoothers can be easily changed and results compared. 


2 CERES plots and GAMs - a primer 


Consider the regression model (given Xı and X2) Y = ag + gi(X1) + 
g2(X2) + €, where ao is an unknown constant and gı and g2 are unknown 
functions with E'(g;(X;)) = 0 and E(e|X1, X2) = 0. In general, X, and X2 
may be random vectors, but for the purposes of this paper, X; and X9 are 
random variables. In other words, for the purposes of this paper, there are 
two predictor variables. 

The idea behind CERES plots (see Cook, 1993) is that if gı is the 
identity function and E(X,|X2) is known, then a CERES plot will display 
the function g2. This display will be with error and possible vertical shift. 
In practice, E(X,|X2) is unknown, so we estimate it by smoothing the 
plot of Xı versus X2, and then estimate g2 by smoothing the CERES plot 
which is obtained by assuming that our estimate of F(X ;|X¢) is correct. 
An implementation of CERES plots using the XLISP-STAT software (see 
Tierney, 1990) is given in a paper by Wetzel (1996). 

GAMs have the additional assumption that € is independent of (X1, X2), 
and the basic idea is that if we know gı then E(Y — ao — gi(X1)|X2) = 
g2(X2). In practice gı is unknown, so we use an iterative algorithm to 
estimate g1, then g2, then gı, etc. An implementation of GAMs is given in 
the S-PLUS software. 

The theory underlying both CERES plots and GAM is powerful; how- 
ever, when used in practice, we need the implementations to be interactive 
enough so that we can be critical users. When using the above methods in 
exploratory data analysis, we need to be able to look ’behind the scenes’ 
to, in the CERES case, see the smooth which estimates E(X1|X2), and in 
the GAM case, see the iterative process. In order to critically use these 
methods, we must be able to adjust and see the new results quickly. This 
need is demonstrated in the next section. 


3 The need for interactive methods 


In this section, we will look at a few examples which illustrate the need for 
interactive methods when using either CERES plots and/or GAM. 
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3.1 The influence of the choice of smoother in GAMs 

The first example involves randomly generated data with the following 
distributions: (1) X, ~ N(0,1), (2) Xe|X1 ~ N(4 + .25 x X1,.01), (3) 
Y =6+ X? + log (X2 — 3). 
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Figure 1: Estimated additive functions from GAM. 


Two Generalized Additive Models were fitted and the summary plots 
from S-PLUS are given in Figure 1. The top two plots are from the fit given 
by the Splus command gam(y ~ lo(x1) + lo(x2)) and the bottom two 
plots from the command gam(y ~ bs(x1) + lo(x2) , bf.maxit=20). Here 
lo and bs correspond to a loess fit and b-spline fit, respectively. 

The summary plots from S-PLUS show the estimated functions g; and go 
as well as the points used to estimate these curves. The horizontal axes have 
the predictors X, and X2, and the vertical axes have what can be thought 
of as partial residuals. They are partial residuals, but are weighted in a 
non-trivial way. As analysts, we are to know that if the points are closer 
to the curve, then the fit will have smaller residual sum of squares. 
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Notice that the predicted function of Xı in both cases appears to be 
quadratic. In the first GAM, we see a loess curve, and in the second a 
b-spline fit. However, the predicted functions for Xz are very different. In 
the first case it appears that Y is dependent on Xz quadratically, and in 
the second, we see the true logarithmic relationship. 

This shows that the choice of smoothers in GAM is very important. 
Ideally, a dedicated data analyst would see that in the first GAM, the loess 
smooth for Xj, is under-fitting for both the small and large values of X;. An 
interactive interface should allow them to interactively change the smoother 
for X,. An interactive investigation of the ’outlier’, when X1 ~ 3, may also 
be informative. In this case, deleting the ’outlier’ does not significantly 
change the predicted model, while changing the smoother does. 
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Figure 2: Estimated E(W |W2) and associated CERES plot. 


3.2 Looking at intermediate plots for CERES 


As demonstrated in Wetzel (1996), the intermediate step of estimating 
E(X,|X2) has a great influence on the resulting prediction. Here, we 
generate data similar to Wetzel (1996): (1) W2 ~ Uniform|{1, 26], (2) 
W,|W2 my N(1/Wa, .01), (3) Z= W, T 1/(1 + exp (—W2)). In this example 
go(we) = 1/(1 + e7™?2). A graph of the function go after being shifted both 
horizontally and vertically, looks exactly like the lower right plot in Figure 
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2. The plots in Figure 2 show two choices of smoother used to estimate 
E(W,|W2) and the associated CERES plot. These plots were obtained us- 
ing the XLISP-STAT package and the code developed in Wetzel (1996). 
The leftmost plot shows the smooth used to estimate E(W,|W.2) and the 
rightmost plot shows the CERES plot. (W, and W2 are centered in all 
plots.) As analysts, the CERES plot should be smoothed and the resulting 
smoother used to estimate go. Again, the further the points in the CERES 
plot are from the smooth, the larger the residual sum of squares for a final 
fit. 

The plots in Figure 2 show that the choice of smoother has a large effect 
of our perception of the amount of noise in the prediction of go(W2). Again, 
a dedicated data analyst would be able to interactively adjust the smoother 
and see that for a coarse smooth for E(W,|W2), the apparent noise in the 
prediction of g2(W2) is reduced. Although experience has shown that a 
coarse smooth for E(W,|W2) often results in a more accurate display of go, 
the point here is that the user should easily be able to experiment with 
smoothers and parameters. In this case, experimentation allows us to see 
that it is not the coarseness of the smooth that makes the second CERES 
plot give a more accurate smooth, but instead it is the fact that the coarser 
smooth is closer to the truth for small values of W2. In fact, if we estimate 
E(W,|W2) with a piecewise linear using only two lines, the CERES plot 
looks virtually the same as the the lower right plot of Figure 2. 


3.3 Influential Points 


Imagine that in our first example, we had an error in measurement in the 
observation where Xj is largest. This point is already a suspected outlier, 
but imagine that instead of a response value of 15.04, a response of 19.04 
was recorded. We fit the same GAMs used in section 3.1, and the plots in 
Figure 3 are obtained. 

The error in measurement actually serves to allow the loess smoother for 
Xj, to begin to capture the true parabolic relationship between X, and Y for 
positive values of X1. GAM. Notice that although the estimated functions 
for the model fit by the Splus command gam(y ~ lo(x1i) + 1lo(x2)) do 
not differ much from those in Figure 1, the observations with values of 
Xə between 4.2 and 4.6 are fit much better when we have the error in 
measurement. 

This shows that single points may be highly influential in the estimated 
fit as well as the perception of fit. 


4 Interactive Methods 


The above examples illustrate that there is a need for interactive meth- 
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ods which will aid the data analyst in understanding the regression. Such 
methods should: 

e allow the user to interactively change the smoothers. 

e allow the user to investigate the influence of individual points. 

e not be too cumbersome for practical use. 
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Figure 3: Estimated additive functions from GAM. 


In Wetzel (1996), such a set of methods was developed for CERES plots. 
The estimation of E(Xı|X2) is displayed and the user may easily change 
smoothing parameters, and/or the points used in the calculation. A set of 
reasonable defaults were established, but the analyst was presented with 
all of the relevant plots. Smoothers can be changed and points deleted and 
after a few mouse clicks, all of the plots are updated. 

An outline of a similar set of methods for GAM is described below. Since 
the GAM procedure is iterative, we need to be able to monitor the process 
through all of the iterations. This process is described by the following 
algorithm (see Hastie and Tibshirani, 1990). 

° initialize: gı = 99, g2 = 99. 

e cycle: gi = = apply a smoother to the plot of Y — 9 -1( X3) versus X}, 

gb = apply a smoother to the plot of Y — gi (X1) versus Xo, 
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e continue until gı and g2 don’t change 


Interactive methods for GAM should allow the user to see each of the 
plots which are smoothed. An analyst’s perception of the appropriateness 
of these smoothes will allow that analyst to proceed with another iteration. 
If the smooth is seen to be inappropriate, a different smoother is chosen, 
and then another iteration is performed. Similarly, if a point is deemed an 
outlier, the analyst will take appropriate action and continue with another 
iteration. The software should keep track of which smoothers were used at 
which iteration as well as which points were used. At some point, we need 
to ’continue until gı and g> don’t change.’ At this point, the observations 
which will be used and the smoothers should be fixed. We have not proven 
a result, but it seems clear that the first few steps of the iterative procedure 
should have little bearing on convergence results. 

In order to illustrate these methods, we return to our first example. 
Using Xlisp-Stat and an initialization of g? = 0 and g} = 0, we obtain the 
plots in Figure 4. The ordering of these plots is left to right, and top to 
bottom. The curves shown in the plots are the smoothes used to estimate 
gi and gi. The indicated numerical argument for lowess is the value used to 
call the lowess function in Xlisp-Stat. No weighting is used in this example. 
For example, the plot in the upper right hand corner shows Y — Y — gi(X}) 
versus X2, where the smooth shown in the upper left plot is used for g}. 

We notice that the lowess smooth used in the first iteration under es- 
timates at the extremes, so in the second iteration we use a 2nd degree 
polynomial and obtain better estimates for both gı and go. 

The code for such methods is currently underway; the first plot in Figure 
4 was created with a command from the keyboard, but the other five plots 
in Figure 4 were created with a series of mouse clicks. A final mouse click 
had the iteration continue until a crude convergence criteria was met. The 
final fits do not appear significantly different from the third row of plots in 
Figure 4. 


5 Discussion 


Finally, it should be clear that there is a connection between CERES plots 
and GAMs. Both find nonlinear relationships between the response and 
predictors. CERES plots assume that all of the predictors act linearly 
except for one. GAMs add additional assumptions to the errors. Berk 
and Booth (1995) compare CERES and GAMS to each other as well as 
other methods. Also, since at each stage in the iterative process GAMs 
use partial residual plots, and partial residual plots assume that the re- 
lationship between predictors is at most linear, strong nonlinear relation- 
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ships between the predictors may results in poor GAM performance. An 
approach to combining these two ideas is being investigated (see Croos- 
Dabrera, 1994). Implementation of these methods should allow interaction 
as described above. 
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Figure 4: Sequential Partial Residual Plots. 
References 


[1] Berk, K. N. and Booth, D. E. (1995). Seeing a curve in multiple regres- 
sion. Technometrics 37, 385-398. 

[2] Cook, R. D. (1993). Exploring partial residual plots. Technometrics35, 
351-362. 

[3] Croos-Dabrera, R. (1994). Graphical analysis of curvature in semipara- 
metric generalized linear models. Ph.D. dissertation, University of Min- 
nesota - School of Statistics. 

[4] Hastie, T. and R. Tibshirani (1990). Generalized Additive Models. New 
York: Chapman and Hall. 

[5] Tierney, L. (1990). LISP-STAT. New York: Wiley. 

[6] Wetzel, N. (1996). Graphical data modeling methods using CERES 
plots. J. Statist. Comput. Simul. 54, 37-44. 


L -Statistical Procedures and Related Topics 
IMS Lecture Notes — Monograph Series (1997) Volume 31 


Nonparametric bounds for the 
probability of future prices based on 
option values 


Gilbert W. Bassett Jr. 


University of Illinois, Chicago, USA 


Abstract: Interest in using option prices to estimate implied probabilities 
of stock values has emerged out of evidence suggesting the lognormal as- 
sumption of the Black Scholes model is no longer accurate. Most of the 
evidence relates to stock index option prices, especially since October 
1987. The Black Scholes model assumes stock prices follow a geometric 
Brownian motion in continuous time - a lognormal! distribution in discrete 
time. The standard deviation or volatility of the stock price process is 
the only unknown value in the formula so that implied standard devia- 
tions (volatilities) can be deduced from observed option prices. Prior to 
1987, however, the implied volatility tended to curve upwards at far from 
at-the-money strike prices. Because of its shape, the relation came to be 
known as the ’smile”. The smile implies a fat-tailed underlying distribu- 
tion, a long recognized feature of stock prices. Since the 1987 crash, the 
smile has deteriorated much farther from what it is supposed to look like 
under lognormality. Not flat and now not even a smile, it skews signifi- 
cantly to the left, indicating large probabilities of price decreases. This 
has led to recent proposals that focus on nonparametric estimates of the 
shape of the underlying distribution. A similar approach is followed here, 
but rather than estimating specific distributions, bounds are derived for 
the set of probability distributions that could have generated observed 
prices. These may be considered as either the first step toward identify- 
ing a single estimate, or as a nonparametric range of estimates for the 
underlying probabilities. 
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1 Introduction 


Interest in using option prices to estimate implied probabilities of stock 
values has emerged out of evidence suggesting the lognormal assumption of 
the Black Scholes model is not very accurate. Most of the evidence relates 
to stock index option prices, especially since October 1987. The Black Sc- 
holes model assumes stock prices follow a geometric Brownian motion in 
‘continuous time—a lognormal distribution in discrete time. -The standard 
deviation or volatility of the stock price process is the only unknown value 
in the formula so that implied standard deviations (volatilities) can be de- 
duced from observed option prices.! Since European calls with different 
strike prices, but the same expiration date, are governed by the same prob- 
ability distribution, they will have identical implied volatilities when the 
lognormal specification is valid. Prior to 1987, however, the implied volatil- 
ity tended to curve upwards at far from at-the-money strike prices. Because 
of its shape, the relation came to be known as the ”smile”. The smile im- 
plies a fat-tailed underlying distribution, a long recognized feature of stock 
prices; see, e.g., Mandelbrot (1963) and Fama (1965). Since the 1987 crash, 
the smile has deteriorated much farther from what it is supposed to look 
like under lognormality. Not flat and now not even a smile, it skews signif- 
icantly to the left, indicating large probabilities of price decreases - what 
Rubinstein (1994) calls, ”crashophobia”. 

The initial response to understanding the smile was to generalize the ge- 
ometric Brownian motion model by making volatility random, while main- 
taining lognormality. Stochastic volatility models generate smiles because 
they make the (unconditioned by volatility) underlying distribution fatter- 
tailed than lognormal. (This corresponds to the well known Monte Carlo 
trick for generating fat-tailed distributions: generate normal variates, but 
with different variances). 

The recent evidence on implied volatilities has led to proposals that 
focus on estimating the entire shape of the underlying distribution; see, e.g., 
Shimko (1993) and Rubinstein (1994). These methods are nonparametric 
and do not presume lognormality. A similar approach will be pursued 
here, but rather than estimating specific distributions, bounds are derived 
for the set of probability distributions that could have generated observed 
prices. ‘These may be considered as either the first step toward identifying a 
single estimate, or as a nonparametric range of estimates for the underlying 
probabilities. 


1 As the only free parameter, volatility stands for everything that affects option prices, 
but which is not in the model; see Figlewski (1989). 
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1.1 Convexity 


The estimates are based on a connection between convex functions and 
cumulative probability distributions. To any cdf F there is the associated 
convex function: 


g(K) = [ F(s)ds. 


Convexity of g follows from the fact that its derivative is the nondecreasing 
cdf F.* The resulting convex function is not arbitrary as its derivative must 
also satisfy the boundary conditions of a cdf; F(z) — 1 and 0 as z — oo. 
Conversely, to a convex function g (satisfying the boundary conditions), 
there is the associated cdf that is its first derivative.° 

This convexity correspondence arises with option valuation because arbi- 
trage-free option prices are necessarily a convex function of strike prices. 
Hence there is always a probability distribution implicit in such option 
prices. Since convexity follows from arbitrage-free valuation alone, there 
exists an implied cdf given any specification of risk preferences and any 
stochastic process for the underlying stock price. 

When investors are risk-neutral the implied probability distribution is 
identical to the cdf of the underlying stock price at expiration. This also 
occurs under assumptions, such as those in the Black Scholes model, where 
call values are determined independently of investor risk preferences. In 
the Black-Scholes model the underlying stock price is assumed to follow a 
geometric Brownian motion, a lognormal distribution in discrete time (or a 
binomial process that is Brownian motion in the limit). When prices follow 
such a process, call values are determined by arbitrage considerations alone 
- risk preferences do not matter - and the implied risk-neutral cdf is the 
same as the one that governs the stock at expiration. 

The existence of implied probabilities however holds generally and does 
not require the lognormal specification. There is an implied distribution 


2This does not require that g be differentiable or that F' be continuous. When F 
corresponds to a discrete distribution, g is a polyhedral convex function. F can be 
recovered from g via the directional derivative, where the direction is determined by the 
left/right continuity convention adopted for cdfs. For properties of convex functions and 
their derivatives see Rockafellar (1970). 

3This convexity/probability connection arises in unexpected places. One case is the 
generalized Lorenz curve used for determining second degree stochastic dominance. The 
generalized Lorenz curve is the g function derived from the quantile (inverse of F) income 
distribution. A different context where the convexity is useful is in verifying that a 
particular linear function of regression quantiles defines an empirical cdf. It is not at all 
obvious, for example, that a combination of regression quantiles defines an empirical cdf 
until the combination is recognized as the derivative of a convex function; see Bassett 
and Koenker (1982, Theorem 2.1, p.409) and Koenker and Bassett (1978). 
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when F is not lognormal, and even if investors do not have explicit prob- 
ability assessments about future values. In situations where risk matters, 
there will be an implied cdf, though it need not agree with the process that 
governs prices at expiration. 

Section 2 briefly describes the connection between convex, arbitrage-free, 
call values and the implied risk-neutral probability distribution. Arbitrage- 
free call values were first described in Merton (1973) (also, see Cox and 
Rubinstein (1985)), and Breeden and Litzenberger (1979)). By proceeding 
from prices to inferred probabilities, we reverse the standard approach in 
which prices arise out of causally prior probabilities. Bounds for the un- 
derlying cdf, given a discrete set of option prices, are presented in Section 
3 along with the modifications needed when convexity is invalidated by 
non-zero transaction costs, bid-ask spreads, and nonsynchronous prices. 


2 Option values and probabilities 


Let c(K) denote the time t value of a European call option with expiration 
T >t. The underlying asset’s current value is S; and the unknown value at 
expiration is Sr. Suppose initially that no dividends are paid between t and 
T, and that there is a continuum of strike prices in the interval [0, Kmax], 
where Kmax is large enough that c( Kyax) = 0. 


2.1 Ruisk-neutral call values 


Let the risk free rate of return be denoted by r. Suppose investors are risk- 
neutral; that is, indifferent between a riskless return and a random return 
with the same expected value; see Harrison and Pliska (1981). Prices in 
equilibrium are then determined by expected values. Let F(s) = Pr[Sr < s] 
represent a cdf for Sy. Since the value S; invested today in risk free bonds 
at rate r yields e("—")S, at time T, the expected value for a risk-neutral 
investor must grow at the risk free rate. Hence, the expectation of F is 
required to satisfy, E(Sr) = e"T—)S,, but otherwise F is arbitrary. 

The value of c(K) at expiration is the random variable, max{0, Sr- K}, 
whose expected value is, 


E|max{0, Sr — K}] = | (s — K)dF(s). 
K 
Integration by parts yields the convenient expression, 


K 
E|max{0, Sr — K}] = E(Sr) + J (as: 
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Finally, let g(K : F) denote the discounted present value of the expectation, 


g(K : F) = eT- Efmax{0, Sr — KY = S, + eT’) f Eo) —1)ds 
0 
(1) 
This provides the basic expression for determining prices from probabilities, 
or probabilities from prices. In a risk neutral world with distribution F, 
call prices are c(K) = g(K : F). Alternatively, given call prices c(K) there 
exists the implied distribution F' such that g(K : F) = c(K). 

From expression (1) we see that the first derivative of g with respect 
to K recovers the underlying cdf, and the second derivative produces the 
probability density; the respective derivatives are F(K)—1 and f(K), each 
scaled by the discount factor. Since the functions are related by integra- 
tion/differentiation, the call price curve will be smoother than the cdf, 
which will be in turn smoother than the density. Finally, expression (1) 
shows that g(0 : F) = S+, reflecting the equivalence between the underlying 
stock and a zero-strike call option. Since the shares, but not the options, 
may receive dividends, the identity has to be modified when dividends are 
nonzero. | 

In view of (1), risk neutral call prices satisfy certain basic properties: 
there has to be an F such that c(K) = g(K : F). What does this imply 
about the form of c(k)? The following are the features of risk-neutral call 
values. 


1. c(K) is nonnegative with c(0) = S+. 
2. c(K) is decreasing with —e"’-7) < de/dK < 0. 
3. c(K’) is convex. 


The first property follows from max{0, Sr — K} > 0; the second says 
the derivative is nonpositive, 


dc/dK = dg/dK = e" T- (F(K) — 1) < 0; 


and the third follows from the fact that the first derivative increases with 
K, or, when there is a density the second derivative is e~’(7—») f(K) > 0. 


2.2 Arbitrage-free call prices 


Suppose now that risk neutrality is relaxed and investors have arbitrary 
risk preferences and perhaps even know nothing of probability. Suppose, 
however, that all arbitrage opportunities are exploited; that is, call prices 
c(K) are such that there are no riskless profit opportunities from buying 
or selling calls, or investing at the risk free interest rate (assuming zero 
transaction costs). What does this imply about c(K)? 
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It is now well known that arbitrage-free call prices are identical to risk- 
neutral call prices; call values are arbitrage-free if and only if there is an 
F such that c(K) = g(K : F). In a risk neutral world the F representing 
the beliefs of investors is the same as the F implied by call prices, whereas 
in a non risk-neutral world the F implicit in call prices is the equivalent 
martingale measure. This may seem surprising since the arbitrage-free 
requirement says nothing about probabilities. The intuition behind the 
equivalence is similar to Dutch book explanations for coherent beliefs re- 
garding probability assessments. When your beliefs are not consistent with 
the probability axioms you can make book against yourself and win (lose). 
To see why arbitrage-free call values must be nonnegative, decreasing, and 
convex, as well as the risk free arbitrage opportunities that would occur 
if one of the conditions is violated, see Cox and Rubinstein (1985, p.237); 
also, see Cox and Ross (1976) for option valuation with stochastic processes 
other than geometric Brownian motion. 


2.3 Call price curves 


Given expression (1) we can identify an F from a c(K’), or a c(K) from an 
F. The types of curves that arise in simple special cases are illustrated in 
the following examples. For simplicity r is assumed to be zero. 


2.3.1 Discrete probabilities 


Suppose Sr is a discrete random variable that takes values s; with proba- 
bilities pj, j = 1,..., J. Then the cdf F(s) = Pr[Sr < s] is a discontinuous 
jump function, and call prices are a linear spline, 


c(K) = e(sj) — (1 — F(s;))(K — 83), sj < K < 8541. 
This situation is illustrated in Figure 1. 
2.3.2 Histogram probabilities 


Let the probability density for prices at expiration be a histogram: in the 
interval, [s;,8;41], Sr is uniformly distributed. Integrating a histogram 
gives a piecewise linear cdf, and integrating again gives a quadratic spline 
for call prices. To see this, write Sr as a mixture of uniform distributions, 


J 
F(s) = > PjU;j(s) 


where, 
0 S < 8; 
= eee Se 
U; (s) $541 Sj 3 S 2 = Sj+1 


S > Sj 
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Figure 1 
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Substituting into (1) gives the quadratic spline, 


j-1 2 
1 K — s; 
c(K) = c(s;)+ X pj(K—s;)+=p,; a —(K—s;), Sj < K < Sj41- 
j=l 2 (S544 Sj) 


Conversely, if c(K) were a quadratic spline then the implied F would be a 
histogram density. This situation is illustrated in Figure 2. 


2.3.3 Mixture models 


The representation of the histogram as a mixture of uniform distributions 
can be extended directly to cases where the mixing distributions are not 
uniform. That is, now let 


This says that Sr is, with probability w;, the random variable S; with cdf 
Fj. In the case where all the Fjs are lognormal with different variances, 
this corresponds to a lognormal stochastic volatility model in which Hull 
and White (1987) showed that call prices are the average of the call prices 
over the mixture. This extends to risk-neutral valuation with any mixing 
distributions, because 


j=1 j=1 


J J 
g x ; DuBo) = X wjg (K : F;(s)). 


3 Estimating and bounding implied probability 
distributions 


We first consider the case where call prices are arbitrage-free and hence con- 
vex, but where there are only a discrete number of strikes. Let c; = c(K;), 
2 = 0,... n, denote European call option prices on the same underly- 
ing asset with the same expiration date T, but different strike prices K;, 
where K; < Ki+ı and Kg = 0. The price of the zero price call is set 
equal to the current price of the stock, S+. (If there are dividends then 
c(0) = e~%(7-*) S, where 6 is the payout rate through the expiration date). 
Assuming arbitrage-free valuation, the remaining call prices must satisfy, 
ci = g(K; : F), for some unknown cdf F. 


Nonparametric bounds for the probability of future prices ... 295 


Figure2 
Histogram Probability 
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The estimation problem is to identify the F that generates the call prices 
(co, ---, Cn). Additional structure can be imposed by introducing restrictions 
on the set of allowed cdfs. When F is restricted to be lognormal with 
unknown variance the problem reduces to estimating the volatility smile. 
Less structure is imposed by Shimko (1993), who essentially estimates F 
and its associated density from a smoothed quadratic fit to the implied 
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volatilities. The method in Rubinstein (1994) is still more general as it 
essentially finds the risk-neutral density that is (least squares) closest to 
the lognormal density that could have generated call prices. 

Consider the set of all cdf’s that are consistent with the observed call 
prices, namely, 


{F | g(Ki: F) =c(K;),i=0,...,n}. 


If there was a continuum of strike prices, this set would consist of a single 
F and the implied cdf would be exactly 


F(K) = eT- de(K)/dK] +1. 
For the discrete strike case, let F(K;) be the analogous difference quotient, 


F(K;) a eT (Tt) c(Ki41) — c(K;) 


+1 2=0,...,.n—-—1, 
Kia > EG 


and set F(Kn) = 1, 
These F'(K;) values can be used to bound the set of cdfs that could have 


generated call prices. Substitute g(K; : F) for c(K;) in the definition of 
F(K;) and use 


Rea J CH alsyds < F(Kiat) 
i) S s)ds < i 

(Ki+ı — Ki) Jk: i 
to show, N 


This yields upper and lower bounds on F at each strike price; namely, 
F(Kj-1) < F(K;) < F(K;). 


When the K;s are close together this interpolation provides tight bounds 
for the allowed probabilities. When strike prices are far apart the bounds 
will be correspondingly large as it is then necessary to interpolate F over 
a large range of unobserved strike values. 

Since F is nondecreasing, the F(K;) values can be used to bound the 
entire F function. Upper and lower cdfs are given by the discrete cdfs 
with jumps at the K; values. Tighter bounds are given by assuming the 
underlying F is continuous. In this case the upper and lower cdfs are given 
by linearly interpolating between the F(K;) values. 

Define an upper cdf by, 


Fy(K) = F(K;) + (K — K;) ES Ea ; 
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Ki < K < Kis, 2=0Q,...,n—1, 


and a lower cdf by, 


P (K) = F(Kj-1) + (K — Ki) ae l 


K; — Ki-1 


Kii < K < Ki, = 1, seg Ml. 


These upper and lower cdfs bound the F that could have generated call 
prices, the only restriction being that the underlying cdf is continuous. 

The bounds are illustrated in Figure 3. The data is from an example 
in Rubinstein (1994, p.781) in which call prices are generated by the Black 
Scholes model. For comparison the figure also shows a lognormal cdf. Since 
the call prices are generated from the Black Scholes model the lognormal 
cdf falls nicely between the bounding cdfs. 


Figure 3 
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Another illustration, also from Rubinstein (1994, p. 784), is shown in 
Figure 4. Call prices are for the S&P500 Index at 11AM on January 2, 1990. 
The reference lognormal cdf is now seen to not fall within the bounding 
cdfs. The bounding cdfs are consistent with the estimated density function 
shown in Rubinstein in that the upper tail is much shorter and the left 
tail much longer than lognormal. Market prices imply a cdf with much 
greater chance of downward price movements than would be suggested by 
a lognormal cdf. 
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Figure 4 
$&P500 Index Options 
January 2, 1990; 11AM, 164 Days to Expiration 


Remark 1 The distributions that fall between the upper and lower cdfs do 
not all have the same expectation. The allowed risk-neutral cdfs are those 
that fall within the bounds and which have their discounted expectation equal 
to the current stock price. 


3.1 Estimation 


Now let g(K; : F) be a model for actual prices that are observed with 
error. The error term stands for all the non arbitrage reasons for differences 
between the call price and its g value. These reasons include: the bid-ask 
spread, nonsynchronous prices, and positive transaction costs. 

(It will be assumed that the difference between g and observed option 
prices does not depend on K. A more general analysis would permit the 
error variance to depend on trading volume, or, what is practically the 
same, the moneyness of the option, |.S; — K|.) 

The presence of the error term means actual prices need not be convex. 
Hence, linear interpolation between adjacent strike prices need not yield a 
convex function, the resulting implied ” cdf” based on the difference quotient 
need not be decreasing, and the implied density and probabilities could be 
negative. 

Figure 5 shows call prices for the closing S&P500 call on July 13, 1995. 
Prices are almost convex; there is slight concavity in the deep in the money 
calls. (Note that, unlike the above example with 11AM prices, these closing 
prices are likely susceptible to slight departures from convexity on account 
of nonsynchronous prices near the close.) Since prices are nearly convex, 
the convex hull of observed prices is used for the values of g at the given 
strike prices. 
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Figure 5 
S&P500 September Call 
July 13, 1995 
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The interest rate and dividend discount factors were estimated using 
put-call parity as in Shimko (1993). Bounds based on the convex hull of 
call prices are shown in Figure 6. The figure depicts bounding cdfs that are 
jumpier than shown in the previous figures, perhaps due to nonsynchronous 
closing prices. Similar to the other figures, however, there is a fat left-hand 
tail and a large difference from lognormality. 


Figure 6 
S&P September Call 
July 13, 1995, Close 
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1 Introduction 


The modelling of the stochastic process followed by the price of an asset 
is an important part of financial analysis. An understanding of this pro- 
cess is the first step to the pricing of derivative securities and general risk 
management. It is therefore important to identify a model for asset price 
processes which is consistent with their major empirical properties, such 
as heavy tailed return distributions, volatility clustering, long memory and 
persistence after volatility shocks. Previous approaches have typically con- 
centrated on specific models, e.g. ARCH, and not succeeded so far to jointly 
model all of the major empirical properties. To attack this problem system- 
atically we first study the marginal distributions of returns and volatility 
for market price indexes. Only after that we feel a substantial effort can 
be made to identify further evolutionary properties of volatility and asset 
price processes. 

In this paper we compare various distributions to model the leptokur- 
tic marginal distribution of asset returns. The distributions considered 
are: the normal (or Gaussian); the stable; the normal-lognormal mixture 
of Clark (1973); the generalised hyperbolic which include the Student t, 
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the normal-inverse Gaussian mixture of Barndorff-Nielsen (1995), the hy- 
perbolic of Eberlein and Keller (1995) and Kiichler et al. (1995) and the 
variance-gamma (normal-gamma mixture) of Madan and Seneta (1990). 
These distributions are all mixtures of the normal distribution and differ 
only by their mixing volatility distributions. 

It is crucial for any kind of serious risk analysis and management to 
emphasise the importance of correctly modelling the tail probabilities of 
returns. This is the reason why we will focus our comparative analysis on 
the identification of typical tail properties of index returns. Tests are per- 
formed on price indexes to directly determine a best marginal distribution 
for returns from the above mentioned alternatives. This distribution indi- 
rectly determines a best marginal distribution for the volatility. The best 
distribution for the index returns, with respect to the likelihood ratio test, 
turns out to be the Student t distribution. This distribution implies an 
inverted gamma distribution for the squared volatility. The Anderson and 
Darling (1952) test is also used to identify specifically the tail properties 
and additionally supports this result. 


2 The class of generalised lognormal asset price 
models 


Consider the class of generalised lognormal models for the asset price pro- 
cess S = {S(t), t > 0} given by the Ito process with stochastic differential 
equation 


dS(t) = u(t) S(t) dt + o (t) S(t) dW (t), (1) 


for 0 < to < t < oo. The stochastic process W = {W (t), t > 0} repre- 
sents the noise process which is assumed to be a standard Wiener process 
on a filtered probability space (Q, F, E = {Fi}ts0, P) fulfilling the usual 
conditions. We also have the drift process u = {u(t), t > 0} and the non- 
negative volatility process o = {o (t), t > 0}. These two processes may be 
constant, deterministic time-dependent or stochastic. In general we assume 
that they are F-adapted, right-continuous with left hand limits and that a 
unique, strong solution for (1) exists. The explicit solution of (1) for the 
asset price S has the form 


S(t) = Stto)exp{ [(u(u) -zodu fowawey)}, 2) 


for 0 < to < t < oo. Throughout this paper we will stay within the class of 
generalised lognormal asset price models. 

Let us define the returns of the asset price process S. We denote ra (t) 
to be the time t (continuously compounded) return of the asset price S for 
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the interval [t,t + A). It is defined as 
ra(t) = log S(t + A) — log S(t). (3) 


When the drift u and the volatility ø are constant for all times t, the 
asset price model, defined by equation (1) or (2), is called the classical 
lognormal model. The Gaussian assumption of the theoretical return dis- 
tribution of this classical model seriously restricts its shape, especially the 
tail thickness. Asset returns are usually observed to have leptokurtic em- 
pirical distributions. That is, they have heavier tails and have a more 
pronounced peak around the mode than a normal distribution. 

We denote the log asset price process by L = {L(t) = log S(t), t > O}. 
The quadratic variation process (L) = {(L)(t), t > 0} is given by (see e.g. 
Jacod and Shiryaev, 1987, §4e) 


(L(t) = f otu)? du, ) 


for a generalised lognormal model. 
We define the empirical quadratic variation process (L) a = {(L) ,(t), t 2 
0}, based on time steps of length A, of the log asset price process L to be 


(L) a(t) = F ral Ay, (5) 
j=0 


where nA < t < (n + 1) A, for some n € N, and ra(-) are the returns 
defined in (3). Note that the empirical quadratic variation process (L) A 
is an estimate for the true underlying quadratic variation process (L). It 
converges (under rather general assumptions) P-a.s. to (L) as A —> 0. 

The daily empirical quadratic variation processes of the log indexes are 
shown in Figure 1. This figure indicates that the quadratic variation pro- 
cesses (L) are stochastic non-decreasing processes. Consequently it follows 
from (4) that the volatility ø is a stochastic process. One can say it is 
the stochastic nature of volatility which makes the distribution of returns 
leptokurtic. Below it will be generally shown that the theoretical return dis- 
tribution in a generalised lognormal model is leptokurtic when the volatility 
ø is stochastic. 
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Figure 1: Daily empirical quadratic variation processes for the log market 
price indexes. 


3 Mixing distributions 


Consider the time discretisation 
0= toSt St <S... (6) 


where t; = iA for alli € N and A > 0. Define % as the largest integer 2 
such that t; is less than or equal to t, i.e. i, = max{i € N : t; < t}. We 
obtain the important class of discrete time generalised lognormal asset price 
models by keeping the drift u and the volatility ø piece-wise constant over 
the discretisation intervals. A discrete time generalised lognormal model’s 
log asset price process is then given by La = {La(t) = La(ti.), t > 0}, 
where 


La(tis1) = La(ti) + (ulti) — 5 o(t)?) Atot) AW), (7) 


for i € N and where AW(t;), i = 0,1,2,..., are independent and iden- 
tically distributed normal random variables with zero mean and variance 
A. The stochastic difference equation can be interpreted as that of an Eu- 
ler approximation for a certain log asset price process L as defined earlier. 
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Note that the discrete time log asset price process La then converges to 
its continuous time limit L as the time step size A tends to zero (see e.g. 
Kloeden and Platen, 1992). 

Consider now the random variable X. We denote its distribution func- 
tion (d.f.) by Fx(-), its characteristic function (c.f.) by ¢x(-), and its 
probability density function (p.d.f.) by fx(-) if it exists. We also de- 
note the nth moment of X by mx, = E(X”), where E(-) denotes the 
Cr pecta non operator. It is well known that the mean ux =E(X), vari- 
ance o% = E((X — ux) ), skewness By = E((X — ux)?)/o3} and kurtosis 
kx = E((X — px)*)/o% are respectively measures of the location, variabil- 
ity, degree of asymmetry and tail thickness/peakness of the distribution of 
X. Note that the moments mx n, n € N, can be calculated by using the 
c.f. 6x(-) with the well known formula 


n = (~i)” oQ (0), (8) 


where gp (.) denotes the nth derivative of the c.f. The mean, variance, 
skewness and kurtosis can be calculated from the moments. 

The volatility o is an unobservable quantity. As such, the quantification 
of the distribution of volatility can not be directly obtained. However we 
can obtain it indirectly via the distribution of returns. Equations (3) and 
(7) give the returns for the discrete time generalised lognormal model as 


ra (ti) = (ulti) — 5 o(ts)?) A + ots) AW(t:), (9) 


for i € N. The return r4 (ti), conditioned on the random variable o(t;)?, is 
a normal distributed random variable with mean (u(t;) — 5 0(ti)*) A and 
variance o(t;)? A. The drift coefficient u(t;) may depend on the random 
eee a(t;)* so we write the conditional mean as €(t;,a(t:)*) = (ti) — 

5 0(t;)* to denote this possible dependence. 

Let us assume some properties for the conditional mean €(t;,a(t;)*) A 
and the conditional variance o(t;)*A of the return ra (ti) given in (9): 

First we assume that the volatility o is a stationary process (see e.g. 
Feller, 1966, §III.7). Then o(t) is identically distributed according to the 
invariant distribution of o for any given time t. The stationarity is a rea- 
sonable assumption because if we look at the quadratic variation processes 
in Figure 1 we notice that the processes are similar over the entire time 
period. They could be described as having linear trends interspersed with 
strongly increasing periods (very volatile) which slowly revert back towards 
another linear trend with the same slope. This indicates that the volatility 
is a stationary process. 
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We also make the simplifying assumption that the mean has the struc- 
ture 


E(ti,(ti)”) A = (n + e0(ti)*) A, (10) 
for some constants 7 and o. 


These two assumptions imply the following relations for the uncondi- 
tional distribution of returns ra(-). It follows that their d.f. is 


Se eu) =) 
ae = | oF dF, E R), 11 
.@)= f T (u) (@ER), (QD 
where ®(-) is the standard normal d.f. The corresponding p.d.f. is 


C1 Pı _ f= (y+ ew A) 
fra(@) = = | EO 1e } al (x €R), 


(12) 


if it exists, and the corresponding c.f. is 
ee 1 
br, (0) = f a í grogas suet dF(u) (@€R). (13) 
0 


The unconditional distribution is called a mizture distribution and the dis- 
tribution of ø? is called the mizing distribution (see e.g. Feller, 1966, §II.5). 

Feller (1966), §X VII.3(i), gives the general representation of the c.f. 6,2(-) 
for the non-negative random variable o°. Then the moments m,2,,, n € N, 
can be calculated by using (8) and this representation of the c.f. ¢,2(-). The 
mean [,2, variance Oo, skewness 6,2 and kurtosis K,2 can consequently 
be calculated. It is easily shown from these values that lio2, 79, B52 > 0 
and k,2 > 3, if ø? is not deterministic. That is, the random variable ø? is 
positively skewed and leptokurtic. 

Let us now compute the mean ur, , variance o2 a» Skewness fr, and kur- 
tosis Kr, for the return ra(-) as important measures required to understand 
the distributional properties of asset prices. Formulae for these measures 
are obtained via equations (8) and (13) giving 


Hra = NA+, 
o = Me hte? 02, A’, 
3 007, A? + 0° 6,2 034 A3 
Bra = r ae (14) 
TA 
: o3 (u22 + a22) A? + 6 0? (Ho? o2 + 6,2 a.) A3 + 04 k,2 a A4 
7) i iar 
Orn 


It now easily follows that if a? is stochastic, i.e. o? > 0, then kra > 3. We 
also get that sign(G,,) = sign(o), by using (14) and the above fact that o? 


The marginal distributions of returns and volatility 307 


is positively skewed and leptokurtic, i.e. G42 > 0, K,2 > 3. The returns 
ra(-) are therefore leptokurtic and skewed in the direction of the sign of o. 
The above relations show that the stochastic nature of volatility implies a 
leptokurtic distribution for the returns. 

We state that in our analysis of price indexes in Section 5 the empirical 
return distributions are fairly symmetrical. Consequently to simplify the 
analysis, we assume that the distribution of asset returns is symmetric. 
This means we explicitly assume that 0 = 0 from this point onward. 


4 Marginal distributions of asset returns 


As already mentioned we focus our analysis on the marginal distribution 
of asset returns. The marginal distributions for the returns in the models 
which we examine in this section are mixtures of the normal distribution. 
They differ by their mixing distribution 07. As discussed above, this implies 
a leptokurtic distribution for the returns if ø? is random. The models can 
be characterised by either its mixing distribution for oĉ? or equivalently by 
its mixture distribution for ra, since each are related by equations (11), 
(12) and (13). 

Below we briefly characterise the different models by only giving the 
probability density function fr, (+) or the characteristic function ¢,, (-) for 
the marginal distribution of the returns ra(-). A more detailed treatment 
can be found in Hurst, Platen and Rachev (1996). 

It must be emphasised that we do not intend to similarly model the asset 
price process as having independent increments as the following models 
do. The i.i.d. modelling assumption is inconsistent with the well known 
properties of volatility clustering, long-memory and persistence for the asset 
prices. We will interpret the result more correctly as an identification of 
the marginal distribution for asset returns. 


4.1 The Mandelbrot and Fama logstable model 


Mandelbrot (1963, 1967) and Fama (1963, 1965) proposed returns to be 
distributed with an a-stable distribution. This occurs when the stationary 
distribution of o? is a maximally skewed a/2-stable distribution with a € 
(0,2) (see e.g. Mandelbrot and Taylor, 1967). The c.f. of the returns ra(-) 


1S 


pra (0) = exp {in A0- 1A} (0ER). (15) 


The parameter a is called the indez of stability and is a shape parameter for 
the distribution, the smaller a, the larger the tail thickness. This model also 
implies infinite variance and infinite kurtosis for the return distribution. 
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4.2 The Clark model 


Clark (1973) proposed the change in the asset price process to be dis- 
tributed with a normal-lognormal mixture distribution. Prices have to be 
positive and represent growth processes therefore we are interested in the 
change in the log asset price process, i.e. returns. We modify Clark’s model 
here and propose the returns to be distributed with a normal-lognormal 
mixture distribution. In this modified version of the Clark model the sta- 
tionary distribution of g? is a lognormal distribution. The p.d.f. of the 
returns ra(-) is 


1 
Irovi 


2 

se 1 ((2—nA)*  (logu—loge? + 4p?) 
f v top) 3 ( A A du (x€R), 
(16) 


The parameter ọ is a shape parameter for the distribution. The return 
distribution for this model has kurtosis xr, = 3exp(y7). 


fra (£) = 


4.3 The log symmetric generalised hyperbolic model 


Various authors (e.g. Praetz, 1972; Blattberg and Gonedes, 1974; Madan 
and Seneta, 1990; Barndorff-Nielsen 1995; Eberlein and Keller 1995; Küchler 
et al., 1995) have proposed returns to be distributed within the class of 
the generalised hyperbolic distributions. We consider the more restrictive 
symmetric generalised hyperbolic distributions for the return distribution. 
These distributions result when the stationary distribution of o? is a gen- 
eralised inverse Gaussian distribution. The p.d.f. of the returns ra (-) is 


1 ad 
fra (x) = SVA K3 (06) an 


(0-4) 
(1+ e) kafot | (c ER). (17) 


where K)(-) is the modified Bessel function of the third kind with index 
A, A € Rand a,é = 0. In addition a 40ifA >OandéFO0if A < 0. 
The parameters A and @ = a ô are invariant shape parameters. The return 
distribution for this model has kurtosis xr, = 3K)(@) Ky42(@)/Ky41(@)’. 

In the following we briefly consider special parameterisations of this 
model which have previously been considered by other authors. 
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4.4 The log Student t model 


Praetz (1972) and Blattberg and Gonedes (1974) proposed that the returns 
should be distributed with a Student t distribution with degrees of freedom 
v > 0. This occurs when the shape parameters \ = —L v < 0 and & = 0, 


i 2 
i.e. œ = 0, and the parameter 6 = c yv. The p.d.f. of the returns ra (-) is 


1 1 
POs r(5v +4 


) (z - 1A)? \ 2? 
7 TvAr(4 o) (+ Eze) (x €R). (18) 


The degrees of freedom v is the shape parameter for the distribution. The 
return distribution for this model has kurtosis kp, = 3 (v — 2)/(v — 4), for 
v > 4, and is infinite otherwise. 


4.5 The Barndorff-Nielsen log normal\\inverse Gaussian 
model 


Barndorff-Nielsen (1995) proposed returns to be distributed with a normal- 
inverse Gaussian mixture distribution. This occurs when the shape param- 
eter A = —4. The p.d f. of the returns ra(-) is 


Vā exp{a} 


fra (2) = ie 


a 
(z-a)? >|. (c -9 A)’ 
p —— R). 1 
(14 ZAA kyl &\/ 1+ ZA | (x €R) (19) 
The parameter & is the shape parameter for the distribution. The return 
distribution for this model has kurtosis k,, = 3 (1 + 1/ā). 


4.6 The log hyperbolic model 


Eberlein and Keller (1995) and Küchler et al. (1995) proposed returns to 
be distributed with a hyperbolic distribution. This occurs when the shape 
parameter À = 1. The p.d.f. of the returns ra(-) is 


=a 1 Ea) (z eR). (20) 


1 
fra) = FS 7R KG) a] PA 


The parameter & is the shape parameter for the distribution. The return 


distribution for this model has kurtosis xr, = 3K1(@) K3(@)/ K(a)’. 
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4.7 The Madan and Seneta log variance gamma model 


Madan and Seneta (1990) proposed returns to be distributed with a normal- 
gamma mixture distribution. This occurs when the shape parameters À > 0 
and & = 0, i.e. 6 = 0. The p.d.f. of the returns ra(-) is 


VÀ 
eV An T(A) 20-1 Pa 1 


A| lz -nA 
x = 1 kann K, (V23 Z x ER). 21 
(ATA) K (VASE) @eR. Q) 
The parameter A is the shape parameter for the distribution. The return 
distribution for this model has kurtosis xr, = 3 (1 + 1/A). 


fra (z) = 


5 Analysis of major world market indexes 


The empirical analysis will be performed on market indexes from the United 
States of America, Japan, Germany, Switzerland and Australia. The Aus- 
tralian index is calculated by Datastream International and the other in- 
dexes are calculated by Morgan Stanley Capital International. The data 
for these indexes are daily data for the 15 years from the beginning of 1982 
to the end of 1996, except for 20 years of Australian index data which start 
from the beginning of 1977. We note that all of these indexes include the 
stock market crash of October 1987. 

In Table 1 we display the results of our analysis. Under the heading 
of Empirical Model we show the total number of daily returns n and the 
sample measure of kurtosis K,,. Also included in this table is the sample 
measure of kurtosis R}, corresponding to the data with the largest absolute 
return removed. 

The two sample measures of kurtosis, kr, and &,,, indicate that the 
index returns are highly leptokurtic and hence are very heavy tailed. In 
fact they are so large that higher moments (including the fourth moment) 
may be unbounded for the index returns, i.e. be infinite. In this case the 
sample measure of kurtosis would be unstable. This is what we observe 
by removing one extreme observation from the sample, the sample kurtosis 
changes significantly for each index. We also observe this property by a 
plot of the sample measure of kurtosis against the sample size which is 
not shown here. It is therefore important in our analysis to concentrate 
on the entire distribution and not just on a single statistic (in particular 
the possibly unbounded sample kurtosis) to identify a good or best model, 
or more precisely a best marginal distribution. We will therefore base our 
final judgement on the likelihood ratio test. 
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It can be shown that all of the models in Section 4 include the classica] 
lognormal model as a specific or limiting case. Consequently we can test 
if each model is significantly better than the classical lognormal model 
by using the likelihood ratio test (see e.g. Rao, 1973, §6e.2). Define the 
likelihood ratio 

hes Liognormal 
Lother 


where lognormal is the likelihood value of the classical lognormal model 
and Lother is the likelihood value of the other model we are testing. The 
asymptotic distribution of —2log A is chisquare with degrees of freedom 
equal to the difference in the number of parameters between the two models. 
Large values of —2logA indicate that the model under consideration is 
significantly better (explains more) than the classical lognormal model. We 
choose the model with the significantly ! largest value of —2 log A to be the 
best model. Intuitively, this is the model which has the largest probability 
for the returns and therefore is adding the most information to the classical 
lognormal model with the minimum number of parameters. 

Another way of comparing the models, especially the tail properties, is 
to statistically determine how close the empirical distribution function and 
the model’s distribution function are for the returns. We use the Anderson 
and Darling (1952) test here. This test increases the power of the more 
commonly used Kolmogorov test in the tails of the distribution by using 
the properly weighted test statistic given by 


A ies ase |Fe(x) — Fn(z)| 


rER VFn(a) (1 — Fin(a)) 


where F}(-) is the empirical distribution function and F,,(-) is the model’s 
distribution function. A good (bad) fit is indicated by a small (large) 
difference between the two distribution functions and hence a small (large) 
value of the test statistic AD. 

The parameters for each model of Section 4 are estimated by the max- 
imum likelihood method. For each model we display in Table 1 the esti- 
mated shape parameter(s), the corresponding kurtosis sr}, the likelihood 
ratio test value —2 log A and the Anderson-Darling test statistic AD. 

For all of the models the likelihood ratio test values —2log A are ex- 
tremely large indicating that all of the models from Section 4 are signif- 
icantly better than the classical lognormal model. In Table 1 we have 
highlighted the most significant likelihood ratio value —2 log A for each in- 


(22) 


(23) 


1The log symmetric generalised hyperbolic model has an extra parameter than all of 
the other models and so it has to be accounted for correctly. 
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dex. It is clear that the log Student t model is the best model for all five 
indexes. 


Table 1: Results for the quantification of the marginal distributions of 
returns and volatility. 


Country 


Cnty 


n 5052 3761 3723 3761 3803 
Empirical con 125.7994 16.7455 21.4201 26.037 93.5112 
RE, 9. — 11. = 11. a 17. _ II. S 


Normal Kra 
AD 2. a 1. ie 4. rare 3. Pier 1. Parr 


1.8055 1.7237 1.6099 1.6931 1.6878 
oe) oe) ore) oO oO 
1433.665 828.7447 1008.4832 1132.1436 1180.4462 
0.045991 0.061149 0.057997 0.060315 0.056682 
0.8452 0.9182 1.0643 0.9833 1.0004 
6.1291 6.9712 9.3127 7.889 8.1616 
1416.0134 806.4511 1034.6007 1101.4388 1229.0101 
43.745 0.29045 0.22173 0.35477 3.5355 


-2.2721 -1.9253 -1.4109 -1.753 -1.7799 
Symmetric 2.6803e-06 | 9.3886e-07 0.20115 1.7939e-06 5.8455e-07 
Generalised 14.0143 263.349 19.1107 6581.6199 6135.9287 
Hyperbolic 1449.143 836.0898 1048.1408 1138.6009 1231.9909 

0.37820 0.057845 0.054036 0.081052 0.12035 


4.5441 
14.0268 
1449.143 
0.37838 
0.97355 
6.0815 
1388.7137 
7808.0 
0.72732 


3.8506 
oO 
836.0898 
0.057845 
0.80127 
6.7441 
800.6066 
0.71372 
0.57797 


3.0687 
ore) 
1047.256 
0.030761 
0.52894 
8.6717 
1033.4658 
0.62053 
0.25477 


3.5061 
ore) 
1138.6009 
0.081066 
0.6519 
7.602 
1092.0354 
1.0496 
0.42248 


3.5598 
oO 
1231.9909 
0.12073 
0.6359 
7.7178 
1209.8897 
105.02 
0.28722 


Student t Kr a 


Normal] 
Inverse 
Gaussian 


Hyperbolic Kra 5.1335 5.3104 5.7425 5.5127 5.6981 
1343.293 755.1277 970.3204 1023.5112 1173.6493 
6.54e+06 8.5788 26.189 31.885 88645 


1.481 1.3375 1.0212 1.1912 1.1225 


Variance 5.0256 5.243 5.9376 5.5185 5.6726 
Gamma 1322.8124 739.5928 965.7311 1007.2743 1165.7439 
3.15e+07 11.613 19.632 36.099 1.00e+05 


Mixed results are obtained when we use the Anderson and Darling test. 
Smaller AD values indicate a better fit. There is a mixture between the log 
Student t model and the log stable model as to which model is the best. 
We note that the October 1987 crash return is having a great influence on 
the test statistic and therefore seems to bias towards heavy tailed distribu- 
tions. We consider this test only as an additional check for identifying the 
appropriate tail properties. From our point of view the likelihood ratio test 
provides the most objective basis for a comparison between the marginal 
return distributions. 
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Based on the above results we conclude that the three parameter Student 
t distribution is the best marginal distribution for index returns. It is 
closely followed by the four parameter symmetric generalised hyperbolic 
distribution, which for all but the Japanese index turns out to be exactly 
the Student t distribution. The stable, Clark and normal\\inverse Gaussian 
distributions can be described as distributions which do not explain the tail 
properties accurately enough. The stable distribution overestimates the tail 
thickness whereas the Clark and normal\\inverse Gaussian distributions 
both underestimate it. The hyperbolic and variance gamma distributions 
are poor and dramatically underestimate the tail properties. 

Some readers may argue that for each index the large negative return 
caused by the stock market crash of October 1987 is an outlier and is 
therefore influencing our results. That is, in favour of a model with very 
heavy tails opposed to one with less heavy tails. We would like to point 
out that from the paper on extreme value theory for asset price returns by 
Longin (1996) the stock market crash of October 1987 is dismissed as an 
outlier, i.e. it is consistent with the rest of the data. For principle reasons 
we do not like to exclude the extreme events such as stock market crashes 
from our samples because it is this feature (namely tail heaviness) we are 
explicitly emphasising to model in a consistent way. However to provide 
a view on the robustness of our results we removed the large negative 
return caused by the stock market crash of October 1987 and repeated our 
study. As to be expected by removing extreme events, the distributions 
with thinner tails improve whereas the stable distribution, which has the 
heaviest tails, gets worse (the results can be obtained from the authors). 
The Student t distribution is still the best distribution from the alternatives 
considered. 


6 Conclusion 


The Student t distribution has been shown above to be the best marginal 
distribution for index returns, with respect to the likelihood ratio test. It 
implies an inverted gamma distribution for the marginal distribution of the 
squared volatility 07. This distributional property can now be exploited 
to identify possible dynamics of the volatility process o and hence the 
evolution of the asset price process. 
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1 Introduction 


Lı estimation provides a somewhat robust alternative to least squares 
estimation for autoregressive models. Define a p-th order autoregressive 
(AR(p)) process 


Yi = do + O1¥t-1 +--+ + PpYt-p + Et (1) 


where {e+} are independent, identically distributed (i.i.d.) random variables 
such that (a) E(e?) < 00; (b) es has median 0; (c) F(x) = P(e: < 2) is 
continuous at x = 0. 

We will assume that the process {Y;} is stationary; for this, we require 


that F 
X okz" #1 


k=1 
for all complex z with modulus |z| < 1. Throughout this paper, we will 
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assume the model (1) with the intercept ġo; however, all of the results 
given in the paper will go through (with appropriate modifications) if ġo is 
suppressed and only ¢1,---,@p are estimated. 

Least squares (or some related method) is typically used to estimate the 
parameters in the model (1). However, when the ep's have heavy tails, least 
squares is inefficient compared to some other methods; one such method 
is L,-estimation. We define Lj-estimators, do, d1, ree, dp; to minimize the 
objective function 


n 
g(vo, V1, +, Up) = 2: [Yt — vo — i Yr—1 — +++ — VpYt-pl. (2) 

t=1 
(This assumes that we have n + p observations but asymptotically has no 
effect.) It is well-known (see Pollard, 1991; Wang and Wang, 1996) that 
the asymptotic behaviour of Lı estimators depends on the behaviour of the 
distribution function F(x) for x close to 0. For example, if F’(0) = A > 0 

then we have 


Vn(bn — p) >a Np+1(0,C/(4)) as n — 00 
where C is a (p + 1) x (p + 1) matrix defined to be 
C = E[X:XT] (3) 


where X; = (1,¥%:-1,:--,¥:-p)?. Note that, contrary to popular belief, 
it is not necessary for F to be absolutely continuous to have asymptotic 
normality. 

The assumption that F’(0) = A > 0 is quite strong in the sense that it is 
difficult to verify; given even very large samples, it is difficult to distinguish 
between a density which is finite at 0 and one which has a singularity at 0. 
For the sample median (which is the Lj-estimator of location), it has been 
shown that (for example, by de Haan and Taconis-Haantjes, 1979) that the 
rate of convergence depends on the behaviour of F(x) for x close to the 
population median; see also Smirnov (1952) who derives the domains of 
attraction for sample quantiles. Similarly, it is not necessary for F to be 
differentiable in order to find a limiting distribution for the L,-estimator 
Pn: We will assume that for some sequence {a,,} with an — oo, there exists 
a strictly increasing function w such that 


Vn(F(t/an) — F(0)) = Y(t) + malt) (4) 


where r,,(t) — 0 as n — oo for each t. Also define 


w= f voas (5) 
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and 
t 
0 
Note that since w(t) is strictly increasing, Y(t) will be strictly convex. 


The formulation given above provides a great deal of flexibility. For 
example, suppose that 


Ra(t) = f ra(a) ds. (6) 


F(x) — F(0) = Asgn(x)|x|°L(|z)) 


for x in a neighbourhood of 0 where a > 0 and L is a slowly varying function 
at 0. (sgn(x) = 1 if x is positive and —1 if æ is negative.) In this case, we 
can take 

an = n! C) L*(n) and w(t) = Asgn(t)|t|* 


where L* is a slowly varying function at infinity. When F(x) is differentiable 
at x = 0 with F’(0) = A > 0 then w(t) = At and an = yn; however, if 
F(x) — F(0) = Azxln(|z|~') for x close to 0 then w(t) = At with a, = 
/nin(n)/2. If F(x) is not differentiable at x = 0 but has positive one- 
sided derivatives At and A~ then 


_ Jj Att fort>0 
He) = 9 prio." 


(This occurs, for example, if the density has a jump at 0.) 
In Section 2, we will determine the limiting distribution of the L4- 
estimator under the general conditions on F described above, we will define 


Zalu) = FEY [ler - Xan] — lel) 7 


Note that Z, is minimized at u = an(Pn — œ). Zn is a convex function and 
hence if the finite dimensional distributions converge weakly to those of a 
convex function Z, it follows that 


dn(bn — $) 4 argmin(Z) 


provided argmin(Z) is almost surely unique (Geyer, 1996). What is inter- 
esting is that only finite dimensional weak convergence is needed and not 
any sort of functional weak convergence (although this is implied by the 
finite dimensional convergence for convex functions). 

In Section 3, we will obtain an “in distribution” Bahadur-Kiefer repre- 
sentation for the Lj-estimator under the general conditions described above. 
This will be done by approximating Zn by an appropriate function Z} and 
looking at the limiting behaviour of n!/4(Zn — Z¥). 
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2 Limiting distributions 


In order to derive the limiting distributions of the Lı estimators, we will 
assume the following regularity conditions. 


(A1) {e;} are 0 median, finite variance i.i.d. random variables with distribu- 
tion function F' satisfying (4) for some w(t) and rn(t). 


(A2) For each u, 
E[W(u? X;)] = r(u) < œ 


where W(t) is defined in (5) and 7(u) is a strictly convex function. 
(A3) For each u, - 
= D X:) —p 0 
as n — œ where R,,(t) is defined in (6). 


Note that condition (A1) implies that E[(u? X+)?] < œ; thus, depending 
on the exact form of Y, condition (A2) may be implied by (A1). A sufficient 
condition for (A3) is E[|Rn(u? X;)|] — 0. 


Theorem 1 Suppose that {Y;} is an AR(p) process satisfying (1) and that 
Zn(u) is as defined in (7). Then under conditions (A1), (A2) and (A3), 


(Zn(u1), ee Zn(Uk)) >d (Z(u1), an Z(ux)) 


as n — oo where 


Z(u) = ul W +2r(u) 
with W a(p+1)-variate Normal random vector with mean O and covariance 
matrix C defined in (8). 
Proof: We will use the identity 
y 
Gey ale eat 0) = I(x >0)] +2 | (x < s) — I(x < 0)] ds 
| 0 
which is valid for x 40. (I(A) is the indicator function of the set A.) Now 
Zalu) = ZP (u) + ZY (u) 
where 
1 n 


a —— X Xf ull(er < 0) = I(et > 0)| 


Zu (u) = =e 
t=1 
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and Zy’ (u) = aah I(et < s) — I(ez < 0)] ds 


5z Zu 


(with Unt = XTu/an). Since, for each u, the summands in ZẸ (u ) are 
stationary martingale differences with finite variance, it follows from a mar- 
tingale central limit theorem that 


ZD (u) >a uT W 


and the convergence in distribution holds for any finite collection of w’s. 
For Z (u), we have 


Z? (u) = DEZ? (w)) + (22 Ce 22 (u) — E(Z® (u). 
Letting v = XT u = = it follows that 
> HA? (a) = my vay [@@)-FO)as 
m 3) i Jn F(s/an) — F(0)) ds 


= =. [eu xe) + Ralu u” X)| 
— or (u) 


where Rn is defined in (6). For the remainder term in VAS ) (a), we have 
(since the summands are again martingale differences) 


Var(Z2)(u)) = no en 


IA 


4 z(u 
max Sen 
{XT u} is stationary with finite second moment and so 


max |X} u| —p 0. 
1<t<n 
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Thus 
Z2 (u) — E(Z) (u)) —p0 asn—oo 


and so Z” (u) —p 27(u). Thus we have 
Zn(u) >a UW + 27(u) = Z(u) 


and the finite dimensional convergence holds trivially. O 


The following corollary gives us a representation of the limiting distri- 
bution of an(@,, — ). 


Corollary 2 Let Pn minimize (2). Under the assumptions of Theorem 1, 


an(Pn — P) a argmin(Z) 


as n —> O. 


Proof: Since 7 is strictly convex, Za is strictly convex and so has a unique 
minimum. The result follows from Geyer (1996). O 


The limiting distribution given in Corollary 2 will not be normal unless 
the function T(u) is quadratic. In the following example, we illustrate the 
computation of the limiting distribution in a special case. 


Example 1 Consider the AR(1) process 
Yı = Qo + O1¥t-1 + Et 
where the €;’s are i.i.d. random variables with density 


_ læ% exp(-lzl) 


falz) = oT (a) 


for some a > 0. (This is a two-sided Gamma distribution.) For x close to 
0, we have (2) a (a) 
sgn(x)|x|* exp(—x 
Falz) — Fal0) x — 
(2) (0) 20(a + 1) 


and so setting an = n'/?%) , we get 


Vn(Fa(t/an) — Fa(0)) > Yalt) = sgn(t)|t|* 


1 
(a + 1) 
with 


lot 
|Vii(Fa(t/4n) — Fa(0)) ~ Yalt)| < Ha) ray. 
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It is easy to verify that conditions (A1), (A2) and (A3) are all satisfied 


(since E[|Y}|"] < oo for all r > 0) and so for given do, ¢1, the limiting 
objective function in Theorem 1 is 


1 
Zalug, u1) = UW Wi + ——— E ai 
aluo, u1) = uoWo + uy e Nara) [uo + ur Y1] 


where Wo and W; are zero mean Normal random variables with Var(Wo) = 
1, Var(W1) = E(YŻ) and Cov(Wo, W1) = E(Yi). By differentiation, we 
determine the minimizers of Za, Uo and U4, to satisfy the equations 


1 eee 
Wo + ———~do (Up, U1) 


T(a +1) = 
1 yee nile 
Wi + Ta + 1) 2M U1) = 0 


where 


do(uo,ui) = E [sgn(uo +u1¥1)\uo + ui¥i|%] and 
di(uo,t1) = E[sgn(uo + w1¥1)Viluo + wi ¥i|%] - 


(sgn(x) = 1 or —1 depending on whether x is positive or negative.) 
If fw(wo, w1) is the joint density of (Wo, Wi), it then follows that the 
joint density of (Uo, 01) (that is, the limiting density of an(@n — @)) is 


fu (uo, u1) = 


z fw E ao) (doo(uo, w1)d11 (uo, u) — dig(uo,u1) J 


doo (uo, u1) E [luo +w Y1 |27] 
dlugo u) = E [YP luo + uY |97] and 
dio(uo,tn) = E [Yiluo +wY1%7] . 


The density fy cannot easily be computed analytically (unless ¢; = 0) but 
can be computed feasibly using Monte Carlo sampling. 


3 Second order properties 


It follows from the proof of Theorem 1 that we can approximate Zn by the 
function 


Z*(u) = -z2 D xfu [I(e; > 0) — (ex < 0)] + 27 (u). (8) 
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It is easy to see that we can approximate an( Py, — œ) by the minimizer of 
Za: For example, if y(t) = At (for some A > 0) and an = yn, it follows 
that T(u) = Au’ Cu/2 and so we can approximate \/n(@, — p) by 


ne DO Xe > 0) — I(e < 0)]. 


More generally, we have 


an( Pn = p) a h~* (W,,/2) 
where h(a) is the gradient of T(u), h™t its inverse and 


1 


Wn = Th dX [I (e; > 0) — (ez < 0)). (9) 


(Typically, h(u) = E[_Xy(u? X;)].) Theorems which deal with the asymp- 
totic behaviour of this approximation error are commonly known as Bahadur- 
Kiefer theorems due to their connection with the work of Bahadur (1966) 
and Kiefer (1967) for sample quantiles. What will be proved below is an “in 
distribution” (as opposed to “almost sure”) Bahadur-Kiefer theorem. The 
following lemma will be useful in determining the asymptotic behaviour of 
the approximation error. 


Lemma 3 Define 
Gn(u) = —g} u + PnU) hn (u) = -xi u + p(u) 


and let un = argmin(gn), Un = argmin(h,). Suppose that 
(2) Tn —> XO; 
(ii) Un — Un — 0; 


(iii) for any t, u and w, 


palu + tw) = onlu) = | EE 


and i 
plu + tw) — plu) = | wt p(u + sw) ds 


for some functions {w,,} and p where w is one-to-one. 


(iv) vo = Y~! (xo) exists and for some a > 0 


lyu) — (w)|| < klju — vl? 
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for all u, v in a neighbourhood of vo; 


(vu) For some sequence {bpn} with bn — co and any compact set K, 
sup |lbn(P,(u) — w(u)) — do(u)|| > 0 
uck 


where do is a continuous function. 
Then 
bn((Un) — (vn)) > —do(vo) 


where vo = wy (xo). 


A proof of Lemma 3 will not be given here. Note that if the function 
(u) has continuous partial derivatives at u = vo then under the conditions 
of Lemma 3 we have 


bn (Un — Un) > -H7 (vo)do(vo) 


provided H,(w), the Hessian of p, is invertible at u = vo. 

Lemma 3 will be applied to sequences of random elements by appealing 
to a Skorokhod-type arguments (see, for example, van der Vaart and Well- 
ner, 1996) to construct almost surely convergent sequences. To do this, we 
will define a space B,(R®) of locally bounded R’-valued functions defined 
on Rt. (By “locally bounded” we mean bounded on compact sets.) If {g,,} 
and g are elements of B,(R*) then we will say that {g,,} converges to g if 


sup ||g,(u) — g(u)|| > 0 
uck 


for all compact subsets K of Rt. A possible metric for this topology is 


a(g,h) = Y` min(1, dg(9,h))2-* 
k=1 


where 
dy(g,h) = sup ||g(u) — h(w)|]. 
|u|] <k 

We also define C,(R%) to be the space of R’-valued continuous functions 
on Rt; C,(RÎ) is a separable subset of B,(R?). If {Dn} and D are random 
elements of B,(R%) such that Dan —>q D and D is (with probability 1) a 
random element of C,(R%) then it is possible find almost surely convergent 
representations of {Dn} and D. 
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(A4) For each compact set K, 
sup ni/4 |2[Xtrn(u? X,)]| — 0 
uck 


as n — CO. 


(A5) For each compact set K, we have 


pO 


sup 


IDAT Kerli X)| => 
uck 


as Nn — CO. 


(A6) For each u, E [XP Xyu" X;)]] is finite. 


Theorem 4 Suppose that {Y;} is an AR(p) process satisfying (1) with Zn 
and Zy defined as in (7) and (8). Then under conditions (A1)-(A6), we 
have 


n!4( Zn (u) — Zi (u)) >a V(u) asn —> œ 


on C\(RP*!) where 
: t 
V(u+tw) —V(u) = 2 | wT D(u + sw) ds 
0 


and D(u) is a zero mean Gaussian process with D(0) = 0 and 


E |(D(u) - D(w))(D(u) - D(v))"| = E [XX7 lyu" X) - YT X)|. 


Proof: Define 
Valu) = n'/4(Z,(u) — Zž(u)) 


and note that V,,(0) = 0 for all n. We also have 


Vr(u + tw) — Va(u) =2 [wD n(u + sw) ds 


where 


Daul = 
ia De [n Xle < uT Xt/an) 
—I (e: < < 0)) => n74 EX y)(u? X] 
since our assumptions imply that the gradient of T(u) = E[V(u? X,)] is 


E(X+(u? X;)]. Clearly, D,(0) = 0 for all n and applying an appropriate 
martingale central limit theorem (Hall and Heyde, 1980), it follows that 
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the finite dimensional distributions of D» converge to those of D. It is also 
straightforward to verify that on each compact set K, 


lim lim sup P sup ||Dn(u) — Dn(v)|| >e] =0 
l0 n= |lJu—v||<d;u,veK 


for every € > 0 by using an appropriate moment condition. Hence D, >q 
D on Bp+1(R?*") and the conclusion follows. O 


Theorem 5 Assume the conditions of Theorem 4 and let h(x) be the gra- 
dient of T(w) with inverse ht. If U minimizes Z and 


|h(u) — h(w)|] < klu = vl|* (a > 0) 


for all u, v in a neighbourhood of U (k and a may depend on U) then 


nil (hlan ($n = $) - 5") ~a -D(A (W /2) 


as n — œ where Wn is defined in (9), D is the Gaussian process defined in 
Theorem 4 and W is a (p+1)-variate Normal random vector (independent 
of D) with mean O and covariance matriz C. 


Proof: Let U, = an(Py — @) minimize Zn. Then it is easy to verify that 
(Un, n" (Za — Z5)) >a (h-*(W/2),V) 


as n — oo on the space R?*! x B,(R?t!) where W and V are independent. 
Since the limit is concentrated on a separable subset of R?*+! x B,(R?t) 
(namely RP++ x C,(RP*)), we can construct a probability space and almost 
surely convergent versions of {U,} and {n'/4(Z,, — Z*)}. The conclusion 
follows by applying Lemma 3 to each convergent sequence. O 

If h(w) is one-to-one (with inverse h~*) and continuously differentiable 
then it follows that (under the conditions of Theorem 5) 


nl |an(b, — p) — h-'(Wa/2)] a -H (h (W /2))D(h™ (W /2)) 


provided that H(u), the Hessian of 7, is invertible outside of a set of 
Lebesgue measure 0 in R?*!. This suggests the asymptotic expansion 


om 


an(n- P) = h*(W,/2) 
-a H- (h (Wn/2))D(h (Wn /2)) + p(n 4). 
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(As before, Wn is defined as in (9).) Whether this expansion is particularly 
useful is an open question. 

Evaluating the limiting distribution in Theorem 5 is tedious but not 
overly difficult (provided, of course, that everything about {Y;} is known). 
For a given u, D(w) is (p+ 1)-variate Normal with mean 0 and covariance 
matrix 

K(u) = E[X:X} y(u" X)|]. 


If K(u) is positive definition for u outside of a set of Lebesgue mea- 
sure 0 then since W is independent of D, it follows that the density of 
—D(h7'(W/2)) is 


1 H 1 
fiw) = |Ort EP -Fnle,w) du 


where 


y (æ, u) = x! K7! (u)x + 4h(u)?Cth(u) 


and the integration is over R?*! with |-| denoting determinant. Likewise 
the density of -H-1 (h~ }(W /2))D(h-1(W /2)) is 


1 H(u)|? l 
falx) = Oirr a P |-32@æ,u)| du 


where 
yls, u) = a? H(u)K~*(u)H(u)x + 4h(u) C7 h(a). 


In the following example, we derive the density f2 in a simple case. 


Example 2. Let Y, = e where {e+} are iid. random variables with 
distribution function F satisfying F’(0) = A > 0. Suppose that we estimate 
only the parameter ¢; of an AR(1) model; call this estimator n. We then 
have 1 

Tt) = zA u 


where o? = E(e?). We also have C = o? and H (u) = 7" (u) = Ao”. Finally, 
K (u) = Aqļu| where y = Ellez|°]. It follows from Theorems 4 and 5 that 


p 1 £ 
1/4 
n / vid S SWT Da > 0) = iG < o) >d S 
where S has density 


3/22 f% 1/2 tf Noe 2-2, 2 
EA À . 
fs(x) = ny!/2 Í ju] exp |-—— ful + 4A olu du 
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S has a symmetric distribution with moment generating function 


244 2 
_ yt yt 
PR RE (zn) . ares 
where © is the standard Normal distribution function. 


4 Final comments 


In this paper, we have derived first- and second-order limiting distributions 
for the Lj-estimators of the parameters of an AR(p) process under fairly 
general conditions on the error distribution. From a statistical point of 
view, the fact that the asymptotic behaviour of the Lj-estimators is so sen- 
sitive to the behaviour of F(x) for x close to 0 is somewhat troubling. One 
possible non-parametric approach to estimating the sampling distribution 
of Pn is to bootstrap the AR(p) process by sampling with replacement from 
the residuals e; = Y; — XT Pn- However, it is possible to show that, asymp- 
totically, this bootstrap procedure is correct to first order only if p(t) is 
a linear function and is never correct to second order. This is similar to 
the results of Hall and Martin (1988) and Huang et al. (1996) for sample 
quantiles of i.i.d. random variables. However, other approaches to boot- 
strapping time series, such as frequency domain bootstrapping, may prove 
to be more fruitful in this problem. 

It may also be possible to exploit Lemma 3 to obtain an “almost sure” 
Bahadur-Kiefer representation. Arcones (1996a, 1996b) and He and Shao 
(1996) derive such representations for Lp-estimators in linear regression 
models. However, these papers assume that F(x), the distribution function 
of the errors, is linear in a neighbourhood of x = 0. Using the notation 
of Theorems 4 and 5, we can conjecture that, under appropriate regularity 
conditions, 


(n/In(In(n)))*/4 (Pen, - ¢)) - ET) =O(1) (10) 


with probability 1 where bp satisfies the condition 


lim y/n/n(n(n))(F (t/a) — FO) = Y(t) 


and the set of limit points of the left hand side of (10) is non-trivial. 
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Abstract: Testing for first-order auto-regressive errors in a linear regres- 
sion model is considered. It is found that the L,-norm based Lagrange 
multiplier test avoids computational difficulties caused by the dependency 
among the errors. Furthermore, the Lagrange multiplier test has the ad- 
vantage that estimation of the error term density at zero is not required. 
As the error term variance increases and the error term density at zero 
becomes larger the asymptotic relative efficiency becomes more favorable 
for the Lį-norm based test relative to the corresponding least squares 
test. 


Key words: Auto-regressive errors, Lj-norm estimation, Lagrange multi- 
plier test. 
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1 Introduction 


The existence of serially dependent error terms when a linear regression 
model is used for analyzing time series data has attracted the attention of 
recent research. Elimination of estimation problems caused by the depen- 
dence involves both the detection of the dependence and the selection of an 
appropriate estimation technique for the model in question if dependence 
is found. 

It is also well recognized in the literature that many data sets contain 
outliers or, alternatively, are well represented by distributions with fat tails. 
This has motivated the introduction of robust estimators, including the L4- 
norm estimator. Simulation experiments of L,;-norm based estimators of 
models with serially dependent error terms are reported in e.g. Coursey 
and Nyquist (1983 and 1986). These experiments indicate that Lj-norm 


330 Hans Nyquist 


methods for estimation are preferable to least squares methods when the 
tails of the error distribution becomes fatter. The experiments also indicate 
that estimators that take account of the serially dependence outperform 
those estimators that ignore it. 

Modelling for serially dependent errors results in a non-linear model, im- 
plying that standard techniques for computing estimates do not apply. One 
procedure for least squares estimation of linear regression models with seri- 
ally dependent errors is that of Cochrane and Orcutt (1949). The Cochrane- 
Orcutt approach was extended to estimation using a more general criterion 
function in Coursey and Nyquist (1986) and further analyzed in Nyquist 
(1992). These analyses show that saddle points and multiple minima are 
very common when the Lj-norm criterion function is used. As a conse- 
quence, the L,}-norm based Cochrane-Orcutt procedure has a tendency to 
converge to points that do not define the global minimum of the criterion 
function. On the other hand, extensions of the Gauss-Newton technique to 
the L;-norm case, as suggested by Osborne and Watson (1971) and Ander- 
son and Osborne (1977), has shown more satisfactory results when it has 
been applied to models with serially dependent errors (Nyquist, 1992). 

In this paper attention is restricted to the case with a first-order auto- 
regressive error process. The aim of the paper is to present and discuss 
the Lagrange multiplier test of the hypothesis of serially independent error 
terms. This test is particularly attractive from a computational viewpoint 
since it only requires estimation of the restricted model, which is in this 
case, a linear model. Computational difficulties caused by non-linearities 
are therefore avoided in this approach. Furthermore, the Lagrange multi- 
plier test has the advantage that it does not require estimation of the error 
density at zero. 


2 The model 


We consider the linear regression model with first-order auto-regressive er- 
rors 


y= Tib +u, t=1,2,...,T, (1) 


ut = Qut—1 + uz, t= 1,2,... T, (2) 


where y; is a response variable, x; is a p-vector of known regressors, ĝ is 
a p-vector of unknown regression parameters, and u; is the error gener- 
ated by the auto-regressive process with parameter @. The disturbances 
U1, U2,..., UT are assumed to be independent and identically distributed. 
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Let X be the T x p matrix with x, as rows. For the subsequent analysis 
we assume that 


Assumption 1: There exists a positive definite matrix D such that lim 
— oo 
LX SD: 


Assumption 2: The common cumulative distribution function of the 
disturbances v+, F, is differentiable at 0 with F” (0) > 0. 


Assumption 3: The disturbances v; have a finite variance, d = V (v). 


Our aim is to test the null hypothesis 


Hy: ¢=0, 


against the alternative 


Hz: o£ 0. 


Lagging (1) by one period, multiplying by ¢, and subtracting from (1) 
yields 


— oYyr-1 = (zi — ġz;_1) 8 + vt, Facey Aas T i 


Conditioning on the first observation, the Lj-norm criterion function is 


defined as 
T 
$) = X [y — PYt-1 — (£; a PT1) p| 
t=2 


T 


= dl — he (B,¢)|, (3) 


where 


hi (8,6) = byr-1 + (x; — ox;_1) b. (4) 


The Ly-norm estimator (3,¢) is defined as the minimizer of S (8, Q). 
The computational problems caused by the serially dependency among the 
error terms stem from the non-linearity of h; (8, ¢). 

The Gauss-Newton approach to minimize S (G,¢) utilizes a first-order 
Taylor approximation of h; (8, ġ) evaluated at a previous estimate (8 (s) pe) 
of the parameters, 
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hf? (8,4) = 


he (B®, 6) + (0-90) G2, + (2-402) (8-89), 6 
where 
ay” = Ut = xB), 


The minimizer of 


T 
S® (8,8) = 7 |u — At” (8,9) (6) 
t=2 


now yields new estimates, (t+), 6(s+1))_ The function pi) (8, d) is linear 
in the parameters, so that standard routines, such as those of Barrodale and 
Roberts (1974) and Armstrong et al. (1979), can be used for finding the 
minimum of (6). Conditions for the convergence of the iteration process are 
found in Osborne and Watson (1971) and Anderson and Osborne (1977). 


3 The Lagrange multiplier test 


The Lagrange multiplier test is based on the gradient of the unrestricted 
estimation problem evaluated at the restricted estimate. Thus, under the 
null hypothesis that @ = 0, the restricted model is the linear regression 
model ignoring dependent error terms. Denoting the restricted estimates 
by 8°") and using (5) we obtain the model 


ye = 0,8 + ba”, + un, ES Zainal (7) 
where 
a”) = y — 2B), t=2,3,...T 


are the residuals computed from the restricted model. We find that the one 
step Gauss-Newton estimate of the auto-regressive parameter @ appears as 
the regression parameter for the constructed variable as”, appearing in 
the linear model (7). The Lj-norm criterion function to be minimized is 
therefore 


T 
So (8,4) = >> |u — 218 — 12,4}. 
t=2 
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There are at least three immediate implications. First, the significance 
of including the auto-regressive error process to the model can be analyzed 
graphically by an added variable plot. Secondly, the significance of the 
auto-regressive process can formally be tested using the Lagrange multi- 
plier test, by testing the significance of the regression parameter ¢ of the 
linear regression model (7). Thirdly, the model (7) can be estimated and 
the gradient for the Lagrange multiplier test can be computed by using 
standard routines designed for linear models. 

The asymptotic distribution of Lj-norm estimators of parameters in 
linear regression models was derived by Bassett and Koenker (1978) and 
Amemiya (1982). An application of this theory to the L,-norm estimator 
(8 ; 6) of the parameters of the linear model (7) yields that VT (8 S 6) 


is asymptotically normally distributed with mean zero and covariance ma- 


trix 
[| D- 0 
“Yo ay 


where w? = 1/ (2f (0))*, provided Hp is true. 
Following Koenker and Bassett (1982) we define 


W (61, 62) = So) (z161 + G2, 62) /vT=]], 


so that W Caer (80 = p) JE — Vb = ) = So (8, 6). We fur- 


ther define the normalized gradient of W as 


g (61, 62) = aad » T3 3) : sign (ue — (2161 + ait” 62) /VT=1) 
Evaluating g at (61, 62) = (vT-I (a — 6) ,-VT = 1¢) yields the 


vector g (41, 62) = (91,92) , where 


1 A(r r 
g2 = aad ”) sign (@ a”). 


If G2 is large then Hp is implausible. The Lagrange multiplier test statis- 
tic is now defined as the quadratic form of the gradient 


2 
T 
alr) _. a(r) 
5 e Ti sign (a; ) i 


a 1d 
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From Koenker and Bassett (1982) it follows that the distribution of € is 
approximately non-central x? with one degree of freedom and non-centrality 
parameter (T — 1) Sp. In particular, when Ho is true the asymptotic dis- 
tribution of £ is the central x? distribution with one degree of freedom. 

A comparison of the powers of the Lj-norm based Lagrange multiplier 
test and the corresponding test based on least squares can be done in terms 
of asymptotic relative efficiency. This quantity is the ratio of the noncen- 
trality parameters of the limiting distributions, ARE = d/w*. This com- 
parison is therefore similar to a comparison of estimators efficiency. For 
normally distributed errors ARE = 2/n = 0.64 so that the least squares 
method is preferable. However, as the error term variance increases and w? 
decreases, the Lj-norm based methods become more favorable in terms of 
ARE. At the Laplace distribution, for example, ARE = 2. 


4 Final remarks 


This paper describes a Lagrange multiplier test for testing for a first-order 
auto-regressive error process in a linear regression model. There are several 
immediate extensions of the procedure. First, since the variance d of the 
disturbances 1 is unknown in most of the applications, it needs to be 
replaced by an estimator in the definition of €. A possible estimator is 
obtained by first estimating @ by minimizing Sp and than defining 


e 1 Gem _ zn) 
Secondly, Assumption 1 can be weakened to 


Assumption 1’: max z! (X'XY! r, — 0, as T — oo. 
P t 
1<t<T 


This would change the normalization of 8— from VT — 1 to (X'X ye a 
Thirdly, a test for a p — th order auto-regressive error process 


Ut = Py Ut_-1 + Pout_g +... + Gpls-p + Uz, t= 1,2,...,T, 


is obtained if (1) is lagged by 1, 2,..., p periods, multiplying the equations 
by 1, $2,..-, bp, respectively, and subtracting from (1) to obtain 


Yt — Pi yt—1 — 2Yt-2 — . . - — PpYt—p 
= (xt — hiti] — PoX4_9 eee a — yt}p) B+ vt, t=pt+1,p+2,...,T. 


The Lj-norm criterion function, given the first p observations, is now 
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T 
S(8,¢)= X. lw — h (b, $) 


t=p+1 
with 

hi (6, p) = 
Piyt—-1 + Poyt-2 +... + PpYt-p + (x — Q1Ti1 — b2ety_9—...- bpp) b. 
The restricted model under the hypothesis that ¢; = ¢2 = ... = ¢, = 0 
becomes 


ye = 2,8 + p1lt-1 + doti_ot+...+@php+u, t=p+1,p+2,...,T 


and the Lagrange multiplier test of the hypothesis of no auto-regressive 
errors of order less than or equal to p is equivalent to the Lagrange multiplier 
test of the hypothesis that the regression coefficients ¢1, ¢2, ..., dp in the 
linear regression model are equal to zero. 

Finally, it is worth pointing out that the approach for testing for auto- 
correlated error terms may be extended, under fairly mild regularity con- 
ditions, to the more general case of M-estimation of linear models. 
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Abstract: In this paper, we test on the Gaussianity and nonlinearity 
of the foreign exchange rate return series by the Gaussianity test due to 
Kariya, Tsay, Terui and Li (1994) and by the five well-known nonlinearity 
tests for stationary time series. The daily returns of the foreign exchange 
rate we consider exhibit the strong non-Gaussianity or nonlinearity, but 
a central limit effect is observed with observational frequency longer even 
under stringent test. 
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1 Introduction 


In economic time series analysis, Gaussianity is often assumed for modeling 
and for constructing asymptotic tests such as unit root test, cointegration 
test, e.t.c. or tabulating their critical points. In particular, in finance it 
is quite common to develop models and theories under the assumption of 
Brownianity or Gaussianity and apply them to real data. For example, the 
Black-Scholes stock option theory assumes that log prices follow a Brown- 
ian process or equivalently that returns of a stock follow a Gaussian process 
and the so-called CAPM (capital assets pricing market) model assumes nor- 
mality for returns at least in their original forms. Although these theories 
have been developed in a less restrictive way for normality, most of empiri- 
cal modelling and many time series tests frequently used for financial series 


338 Nobuhiko Terui and Takeaki Kariya 


are constructed under Gaussianity. Therefore it is quite important to check 
whether our financial data is consistent with the assumption. 

In this paper, applying the Gaussianity test proposed by Kariya, Tsay, 
Terui and Li (1994), shortened as the KTTL test! below, we test on the 
Gaussianity of the five foreign exchange rate returns and by considering 
the close relationship between the Gaussianity and linearity, we apply some 
nonlinearity tests to these data sets. The KTTL test of Gaussianity is a 
test that checks the consistency of the moment structure of a series with 
that of a Gaussian process up to an arbitrary order of the moments. In 
fact, the Gaussianity or equivalently normality is completely characterized 
by the moment structure (see Billingsley, 1986). It is also noted that the 
KTTL test is consistent with any stationary time series structure, and that 
such tests as the skewness test and the kurtosis test e.t.c. are not only par- 
tial tests, but also they assume 7.i.d.-ness (independently-and-identically- 
distributed-ness). On the other hand, most of nonlinearity tests for station- 
ary time series developed so far employ a linear process as a null hypothesis 
and set up, as alternatives, specific nonlinear models with additive noise. In 
fact, although there is a gap between non-Gaussianity and nonlinearity in 
stationary time series structure, induced time series from specific nonlinear 
model is almost always non-Gaussian as far as we assume the Gaussianity 
on the additive noise of nonlinear model, of which assumption is common 
in practical testing procedures. In this sense, nonlinearity tests constitute 
some kinds of the Gaussianity test. 

We describe the KTTL test in Section 2. In Section 3, some nonlinearity 
testing methods for stationary time series are explained. Our empirical 
observations on the Gaussianity and nonlinearity of the return of foreign 
exchange rate are provided in Section 4. We observe the followings: 

The daily series of the foreign exchange rate return show strong non- 
Gaussianity by the KTTL tests and nonlinearity by the five nonlinearity 
tests. The analysis of different observational intervals show the operation 
of a central limit effect in the sense that the p-values get larger as the 
observational period become longer for most of series. The KTTL omnibus 
test shows that the 6th order moment structures of monthly series are 
not consistent with those of a Gaussian process, although the moment 
structures up to the 4th order are not always incompatible with those of 
Gaussianity, where most of previous tests, including Jack-Bera test, etc., 
take only up to the 4th order moments into considerations. 


‘FORTRAN code for the KTTL test is available on request via e-mail : 
terui@econ.tohoku.ac.jp. 
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2 The KTTL test for Gaussianity 


In this section, we describe the KTTL test of Gaussianity. To do so, we 
first summarize the KTTL test of multinormality, from which the KTTL 
test of Gaussianity follows. 

Let Y = (y1,y2,---,Yn) follow multivariate normal distribution with 
mean vector 4 = (1, H2,---, Hn) and covariance matrix © = (o;;), and 
denote the standardized variate of y; as z; = (yi — wi)/,/ou. Define wl) = ) = 
hP) (zj) as the p-th order Hermite polynomial of z; for p = 1,. pP and 


j = 1,...,n. For example, wi) = = z;,w?) = G Siy. i = (23 A 


3z;) LA, 6, u? = (z4 — 62? + 3)/ V24 and so on. 
Let 
yen — Cov(w!??, w), (1) 


Then, under the null hypothesis of aT of Y, a a that for 
every i,j = 1,...,n and p,q = 1,...,P, E(w? = = 0 and Yi (pq) _ = gi) with 


(nq) | % P=4 
pi = 0 PF4, 2) 


where ¢;; = Cov(w; (1) w™). This characterization is shown in Kendall and 
Stuart (1964, p. 600). Granger and Newbold (1986, p.308) discussed its 
application to time series as a technique of instantaneous transformation. 
Let Y;,t = 1,...,T, be the vectors of observation of Y and define the 
as version of z; as uit = (Yit — Yi.-)/V Sii, Where Y;. = DS yit and 
$3 =F Ly (YaTT )(yjt—-9,.). And denote the corresponding transformed 
ae of u; via Hermite polynomials as uP) = a hP) (uit) for i = 1,... n 
and p = 1,...,P. Under the assumption of the existence of the 2Pth 
order moments of Y1, Y2,..., Yn, Where P is the maximum order of the 
Hermite transformation under consideration and fixed in advance, two kinds 
of estimator for the covariance between w? ) and wi are used to construct 
an asymptotic test of multinormality. One is a consistent estimator only 
under the null hypothesis of multinormality, and the other is a consistent 
estimator under any arbitrary distribution. The former is given by 


(pa) _ (ĝi oe p=q 
Pis =] 0 P#F4, 8) 


(1) 


w, and the the latter is the conventional sample covariance estimate of 


where bij is the sample correlation coefficient between z; = w) and z; = 
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T 
rPI) — = Boe (4) 


The proposed test employs the difference between (3) and (4) as a test 
statistic. 
Define a Pn x Pn symmetric matrix R by 


ROD) RO? ... ROP) 
R@1) RCJ ... REP) 
R=]. a. We (5) 
ROP) RPD ... RPP) 
where the R®2 is an n x n matrix whose (i,j) element is ri ” Now 
compose Pn(Pn + 1)/2 dimensional vector rp as follows: 
rPP) = Vech(R®®?)): n(n+1)/2x1 (6) 
7 (Pd) — Vec( RPD) . nex I (7) 


rp, = (rt rD... rOPYY: fi = Pn(n+1)/2x1 (8) 


CSCO gE a e E a PIP a 
(9) 
rp = (Tp, Tp) : f= fi + f2 = Pn(Pn+1)/2 x1 (10) 
Here for any n x n symmetric matrix A = (a;;), 
Vech(A) = (411, 12, °** , Qin; 022,023, ***,@2n3***}@nn) 
and for any n x n matrix B = (b,;), 
Vec(B) = (b11, bi9, cae , bin; b21, b22, oe , ban} n Dals bn2, ce Die) 
Correspondently, by replacing re a) by we 1) we can derive an f x 1 pa- 
rameter vector wp under the alternative hypothesis. Also under the null 


hypothesis, in the same way we define, by replacing rep 9) by ge 3) and 


oe 9) we can define 


bp = CIR o'y and Pp = ($p O'Y, (11) 
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which correspond to rp and wp respectively. Then, assuming the 2Pth 
order moment of yit, rp follows the asymptotic multivariate normal distri- 
bution, 


VT(rp—wp)~N;(O,A) (A is the function of wp). (12) 
Similarly it follows from (3) under the null hypothesis that 
VT (bp — bp) ~ N;(O,A*) (A* is a function of oo). (13) 


The KTTL test detects the equivalence between p p and @ p under the 
null hypothesis based on the difference of rp and dp = (bp, O’). As 
we have {n(n + 1)/2} equivalent relationships; ge 1) = ro ) the test is 
g = {f — n(n + 1)/2} dimensional vector 


C Po = Cpo(rp) = (Cpi, C'P2)', (14) 
where 
Cpi = (c2? PMY (15) 
C po = (eD, anoe cb PY. ene : eP -LPY and (16) 
4 (p,p) 
c9) — l no o a p=q (17) 
rPa), p#4q. 


In fact, C po is expressed as 


C'po = [O I\(rp — $p), (18) 
where O is the (f — n(n + 1)n/2) x 1 zero matrix, and I is the (f — 
n(n+1)n/2) dimensional identity matrix. Under the null hypothesis, where 
wp = dp, it holds from (12) and (13) that 

VTC po ~ Ng(O, J(bp)AI(op)’), (19) 
where the (f — n(n + 1)/2) x f matrix J(@p) is 


OC Po 
J(ġp) = 
s Or p Tp=Qp 
AOc(2:2) ðc PP) ðc?) Oc(P-1,P) (20) 
E Or p i orp i Orp i i Orp rp=op 
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and the matrix differential is given by 
-~ ( (ay | (21) 
p Or; 


The KTTL test statistic for multinormality is a Wald type chi- square test 
defined by 


aa 497-1 
Wo =TC'r, JAS Ca; (22) 
where J = J(pp) and A = (A%%")) with 


(pq,ab) _ ) (q) a) (b b 
Mian = 7 LS (oP? -rE — ro (23) 
i=1 
The KTTL test of Gaussianity for univariate series is a modification of 
the above test. Let {x;} be a univariate stationary process with E(x) = u 
and Cov(xt, tt-k) = Yk and assume the mixing condition XZ |kllyz| < 
oo. Then the following two methods for constructing a test are proposed. 
[I] Overlapping method 
Set, fori =1,...,n, 
Yit = Lt-i+1, (24) 
and corresponding to i.i.d. case, we define zit = (yt— u) //Y0 = (Tt-i+1— 


H) / s/o- 


[II] Non-overlapping method 


For a given positive integer n and the realizations {z),---,zy}, set n- 
dimensional non-overlapping random vectors {yt = (y1t,---, Ynt) } with 
Yit = In(t-1)4t = 1,---, n;6=1,---,T, (25) 


where T = |N/n] is the integer part of N/n. Then we define zit = 
(Pati 1)+1 — p)/ 10. 

For both [I] and [II], it follows that Cov(zit, zit) = Yi-;/Y0 = Pi-j = Qij- 
Defining wP = = hP) (zi), just like i.i.d. case, under the Gaussianity of £t, 
it holds that YP? = g with 


git) = i A J $ 4 . (26) 


The sample variates of zi and w) are defined as uit = (vit — Ji) / VS Sü» 
yl?) = hP) (uz) for i =1,...,nandp=1,...,P, where J; = T 1 YT yi and 
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1 oT = = l 
Sij = F 2ot=maz(i j) (Yit — Ti) (Yje — Jj). For time series data, the estimates 
corresponding to (3) and (4) should be respectively modified as 


r23) — 
ae > 3 up of) (27) 


t=maz(t,j) 


and ; 
ge? = Pi- a (28) 


for i,j = 1,...,n : p,q = 1,..., P. Under the Gaussianity of {z+} with the 
mixing condition above, Keenan (1983) proved the asymptotic normality 
that /T(rp — dp) > N(O, JAJ'). Therefore, exactly following the argu- 
ment of the i.i.d. case, we obtain the asymptotically x? test statistic (22) 
with d.f. {f — n(n + 1)/2} for testing the Gaussianity of a stationary time 
series. 

The test in (22) test is an omnibus test which detects departures from 
Gaussianity, and it can be decomposed into two parts. The first part tests 
departures from the even moment structure when p = q and the second 
part tests departures from the odd moment structure when p Æ q. It is 
useful to have a separate test for each part when we are interested in the 
symmetry and the tail behavior of the underlying distribution separately. 
Corresponding to the dimensions of C pı and C p2, the appropriate decom- 
position of A and J produces the following test statistics: 


A A ^ —1 
Wi = TC'p, [Bnin] C pi, and (29) 


^ —1 
Wz = TC'pa [Â22| Co. (30) 


Under the null hypothesis of Gaussianity, the asymptotic distributions of 
these two test statistics are x? with degrees of freedom fı — n(n + 1)/2 
and fz respectively. In fact, setting A = [J AJ 1-1, and decomposing A 
appropriately according to the dimension of C pı and C’p2, then we have 
the following relationship among Wo, Wı and W9; 


Wo = TC, (FAT) Cro 


Aii A 
T(C pi, po) ( ‘Av ve (Cpi, Cpe) 


TC p, AuC Pi + TC po A22C p2 + 2TC'p, A12C P2 
W, + Wa + 2C 2, (31) 
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where C12 = TC'p,A12C po. Now considering here Cop Aa p2 may be 
regarded as a kind of correlation between C pı and C p2. Te in the case 
where both C'p; and C’p2 are one dimensional vector, C12/(V W1 vV W2) is 


shown to be equivalent to negative correlation between Cı and C2. 


3 Some nonlinearity tests 


Let 
Xt = h(Xe-1, Xt-2, mes Xt—p) + €t (32) 


be an autoregressive nonlinear time series model, where {ez} is i.i.d. with 
mean zero. If we assume the innovation e; as Gaussian, nonlinearity test is 
equivalent to Gaussianity test. Here we use the five well known nonlinearity 
tests: 


(i) Ori-F test by Tsay (1986) 

(ii) Aug-F test by Luukkonenn, Saikkonen and Teräsvirta (1988) 
(iii) CUSUM test by Petruccelli and Davis (1986) 

(iv) TAR-F test by Tsay (1989) 

(v) New-F test by Tsay (1988) 


All of these tests set up, as a null hypothesis, a linear process. Based on 
the Volterra expansion of (32) around O = (0,0,---)’ 


fore) fore) ove) 
Lt = u+). uittart >S Vitaliy ` T o A E o TE "-+Et, 
u=1 u,v=1 U,v,w=1 
(33) 


where 


8?h 
Otia tiy O 


Oh 
h(O), Pu = Ir 


t—u 


? Quow 


T 
Il 


| , Puv = 
O 
ðh 


O N e.t.c. 
OLt—uOLt—vOTt—-w i l 


O 


the Ori-F and Aug-F tests detect against the nonlinearity of the second 
and third order polynomials respectively. The CUSUM, TAR-F and New- 
F tests assume the threshold type nonlinear alternatives; 


: P : , 
re =P + BP asta? (5 =1,2), (34) 
i=l 

where {al a is the innovation of mean zero and variance 07. The New-F 
test covers the most extensive alternatives of nonlinearity, including ExpAR 
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proposed by Haggan and Ozaki (1981) 


Xe = {$1 +m exp(—yXP_1)} Xt-1 + {2 + T2 exp(—yX7_1)} Xt_2 + 
“++ {bp + Tp exp(—7X¢_1)} Xi-p + et (35) 


and bilinear models by Granger and Anderson (1978), Subba Rao and Gabr 
(1984) and others 


P r m k 
Xt + X ajXi-j =X cjetj +Y Y bijXi—iet-j. (36) 
j=1 i=0 i=1 j=1 


The detailed procedures and distributional properties regarding these tests 
are found in Granger and Teräsvirta (1993). 

In order to implement these tests, the order p of autoregression for all 
the tests and the value of delay parameter d for the tests (iii), (iv), and (v) 
need to be specified. We set the maximum of p as 10 and let d run from 
1 to 10. Each of nonlinearity tests with different set of (p,d) brings out 
different results. We employ the most significant result of the test among 
all the combinations of (p, d). 


4 Foreign exchange rate 


Since the work of Westerfield (1977), a battery of research about the foreign 
exchange rate has been done. For a survey, see Levich (1985), Isard (1988), 
Mills (1993) and Campbell, Lo and McKilay (1997). Many of those works, 
for example, So (1987), Wolff (1987), Enders (1988) and etc., employ the 
Gaussianity of the process as a basic assumption. The careful investigations 
of this assumption should be needed in advance. 

In this section, we investigate the Gaussianity and nonlinearity of five 
foreign exchange rate return series against US dollar: FRF (French Franc), 
JPY (Japanese Yen), CHF (Swiss Franc), GBR (English Starling), DEM 
(German Mark). We deal with three kinds of data; daily (1992.1.1.- 1993.12. 
31. (552 samples)), weekly (1984.1.2. - 1993.12.27. (521 samples: Mon- 
days)) and monthly (1983.10.31. - 1994.10.31. (132 samples)). Over the 
observational period, we assume a homogeneity between these data. 

We define a exchange rate return x; at time t as x; = log Ri — log Rt_-1 
where R; is a exchange rate at t. 


4.1 Gaussianity 


In this article, we use the overlapping method of the KTTL test because 
macroeconomic time series usually do not have enough numbers of the 
sample. Denote the KTTL test with P = 2 and n = 1 by O-21 and so on. 


346 Nobuhiko Terui and Takeaki Kariya 


Applying the KTTL test to the data, we move P = 2,3 and n = 1,:--,4 
and the p-values of the tests O-21 through O-32 are tabulated in Table 1. 
The other tests with greater than P = 3 and n = 2 (that is, O-33 and 
O-34) reject the Gaussianity very strongly. In the following, we enumerate 
the empirical findings from the results. 


1) Daily series 


a) All the tests applied to other than CHF series reject the Gaussianity. 
The marginal moment test Wj rejects the Gaussianity with 5% significant 
level for all the series. In fact, the maximum value of the p-values is 0.03955 
for CHF. 

b) In O-21 and O-22 tests, the results of the W2 test are not significant 
for many cases. 

c) The Wo test can not reject the Gaussianity of CHF even when P and 
n are large. 


2) Weekly series 


a) The Gaussianity of JPY is rejected more strongly than other series. 
b) Compared with daily series, the number of significant series increases. 


3) Monthly series 


a) In case of P = 2, not only the omnibus test Wo, but also the marginal 
tests W, and W% can not reject the Gaussianity for all the series. 

b) Compared with daily and weekly series, many p-values of all the tests, 
Wo, Wı and Wo, are larger. 

c) The W, and W% tests reject the Gaussianity of GBR strongly, however 
the W, test can not necessarily be inconsistent with the Gaussianity (O-22, 
O-23 and O-31). 

d) The W, test rejects the Gaussianity more strongly when we set P = 3. 
This means that the 6th order moment structures of data sets are not con- 
sistent with those of a Gaussian process, although the moment structures 
up to the 4th order are not always incompatible with those of Gaussianity. 

e) In case of P = 3, we have some cases where the omnibus test Wo 
reject the null hypothesis more strongly than the marginal tests W4 and 
W2 (JPY, CHF, DEM of O-31 and JPY, CHF of O- 32). This means that 
2C 2 of (21) takes the large positive values for those cases. 


As a whole, we observe that, given P, the hypothesis of Gaussianity 
tends to be rejected more strongly, as n gets larger and in case of P = 3, 
and the Gaussianity is completely rejected for almost returns, especially in 
case of n > 2. Further we observe that a central limit effect is working on 
weekly series in the sense that the p-values get larger as the observational 
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period become longer. 


Table 1: The KTTL Tests: Foreign Exchange Rate Returns. 


Daily(552 samples) Weekly(521 samples) Monthly(132 samples) 
Variable P{W 
O-21(D.F.: Wo : 2, W1 : 1, W2 : 1) 
57667 


FRF 0. 0.11414 ‘ 0.35430 0.47818 0.28592 0.09538 
JPY ; 0.35201 0.00674 0.00112 0.05900 0.66199 0.69024 0.49823 
CHF 0.14530 0.03956 0.84395 0.42304 0.23533 0.72896 0.72221 0.79985 0.37821 


0.15180 0.26114 0.00262 0.03956 
0.73782 0.22792 
O-22(D.F.:Wo : 7, W1 : 3, Wo: 4) 
0.05483 i 
0.00305 0.00000 0.06881 
0.63944 0.16946 0.43741 
0.35538 0.00002 0.02632 
0.07231 0.96215 
0-23(D.F.:W, : 15, W1 : 6, Wa : 9) 
0.02444 0.00000 0.29071 
0.00126 0.00000 0.08059 
0.71675 0.16591 0.52572 
0.47406 0.00000 0.00168 
0.04213 0.00118 0.99706 
O-24(D.F.:Wo : 26, Wy : 10, W2 : 16) 
0.00261 
0.00016 0.00000 0.01655 
0.48250 0.01646 0.60762 
0.48250 0.01646 0.60762 
0.01875 
O-31(D.F.:Wo : 5, Wy : 2, Wo : 3) 
0.30700 : 
0.00002 0.00000 0.00000 
0.86862 0.04989 0.34839 
0.00573 0.00000 0.00000 
0.35201 


0.26506 0.11417 0.77253 
0.55148 0.53422 0.16216 


0.00076 a 
0.01096 0.00000 0.01128 
0.20441 0.01291 0.36325 
0.00083 0.00000 0.00000 
0.02160 0.00128 


0.75493 0.41573 0.06835 
0.87798 0.88747 0.81214 
0.94473 0.75097 0.73328 
0.53963 0.00000 0.00006 
0.78226 0.78435 0.17848 


0.00037 0.00000 
0.01999 0.00000 0.00439 
0.28879 0.00219 0.09646 
0.00019 0.00000 0.00000 
0.03287 0.00014 


0.84104 0.22524 0.00350 
0.85650 0.25579 0.83568 
0.98131 0.78192 0.81381 
0.73477 0.00000 0.00000 
0.81120 0.89435 0.08016 


0.00026 0.00000 
0.00819 0.00000 0.00005 
0.43501 0.00029 0.03210 
0.00002 0.00000 0.00000 
0.02310 0.00000 


0.72710 0.07933 0.00000 
0.87774 0.09242 0.75604 
0.87173 0.83366 0.48292 
0.17422 0.00000 0.00000 
0.67201 0.87080 0.01459 


0.00000 
0.00000 0.00000 0.00000 
0.09266 0.00000 0.00000 
0.00004 0.00000 0.00000 
0.00000 


0.0 0000 0.33776 0.00000 
0.00 012 0.07334 0.03814 
0.0 6818 0.18915 0.38700 
0.2 5814 0.00000 0.00001 
0.0 0000 0.61632 0.00000 


0.00000 0.22459 ; 3 0.00000 0.01589 0.00000 
0.00000 0.00000 0.00000 


JPY 0.00000 0.00000 0.00000 0.00000 0.00266 0.00000 
CHF 0.13266 0.00000 0.00000 0.45164 0.00005 0.03728 0.00217 0.15251 0.13535 
GBR 0.00000 0.00000 0.00000 0.00009 0.00000 0.00000 0.00000 0.00000 0.00000 


0.00000 0.51822 
P(W;) means the p-value of the test W;, i = 0,1, 2. 


0.00000 0.18092 0.00000 


4.2 Nonlinearity 


Table 2 shows the results of applying the five nonlinearity tests explained in 
Section 3 to the foreign exchange returns. The CUSUM test produces very 
different results from other tests. In Terui and Kariya (1996), it is shown 
that, applying these tests to 214 Japanese stock returns, the empirical dis- 
tributions of the p-values of the CUSUM test have a very different shape of 
distribution function from those of four other tests. Further the simulation 
studies by Tsay (1988, 1989) indicate the low power of the CUSUM test 
and we can see this observation as an empirical evidence of it. Therefore 
we leave out the results of the CUSUM test. 
From the table, we observe; 


1) Daily series 


a) FRF and GBR are non-Gaussian judging from all the nonlinearity 
tests. 
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b) JPY is incompatible with a Gaussian process for all the tests. It 
is possible for JPY to have a nonlinearity other than what the five tests 
assume as alternative, because the KTTL test also supports the non Gaus- 
sianity. 

c) For CHF, the Ori-F and Aug-F tests support the Gaussianity and the 
TAR-F and New-F tests suggests the non-Gaussianity. Therefore CHF can 
have a threshold type nonlinearity. 

d) DEM is compatible with a Gaussian process by the Aug-F test and in- 
consistent by the TAR-F and New-F tests. The threshold type nonlinearity 
might exit for DEM. 


2) Weekly series 


a) FRF, CHF and DEM could have threshold type nonlinearity because 
the TAR-F and New-F tests reject the null hypothesis of linearity. 

b) The non-Gaussianity is suggested for JPY by all the tests. 

c) GBR is compatible with a Gaussian process for all the tests. 


3) Monthly series 


a) There are no returns with the p-values greater than 1%, except for 


the TAR-F test applied to JPY. 


Table 2: Nonlinearity Tests: Foreign Exchange Rate Returns. 


Variable New-F 
Daily 
0.00000 | 0.00110 
0.24706 | 0.25270 
0.68311 | 0.02595 
0.00003 | 0.00262 
0.08136 | 0.00511 
Weekly 
0.06449 | 0.00341 


0.00000 
0.14047 
0.46322 
0.00012 
0.03721 


0.10151 
0.74493 
0.04235 
0.00920 
0.02198 


0.00053 
0.05492 
0.02791 
0.00241 
0.02273 


JPY 
CHF 


0.08246 


0.18164 | 0.05208 


JPY 0.01152 0.11705 | 0.00000 
CHF 0.46697 | 0.68680 | 0.02721 | 0.04223 | 0.01986 
GBR | 0.06608 | 0.14518 | 0.11549 | 0.33748 | 0.07851 


0.06415 | 0.05298 | 0.00125 
Monthly 


0.04798 | 0.11767 


0.11915 | 0.00035 


0.07120 


0.04712 | 0.10081 


JPY 0.04101 | 0.01300 | 0.00159 | 0.12351 | 0.03721 
CHF 0.03099 | 0.07967 | 0.10196 | 0.15927 | 0.04794 
GBR 0.17833 | 0.22842 | 0.04982 | 0.13293 | 0.02420 


0.10053 | 0.12918 | 0.15184 | 0.06969 | 0.11015 
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Abstract: The bivariate ranks and quantiles based on the bivariate affine 
equivariant median are considered. Correspondences between two differ- 
ent plots for bivariate data, the direct diagram and the Oja rank plot, 
are described. Several illustrative examples are given. 


Key words: Affine invariance, affine equivariance, bivariate quantile, bi- 
variate rank, multivariate median. 


AMS subject classification: 62G30, 62F07, 62H99. 


1 Introduction 


Rank methods occupy a central role among standard univariate statistical 
methods, and form the backbone of conventional nonparametrics. Conse- 
quently, it has been recently of some interest to explore concepts of rank for 
multivariate data, and in particular, for bivariate data. There are various 
alternatives, including ideas based on depth (Liu, 1990,1992; Liu and Singh, 
1993). But another analytic definition of bivariate rank which leads to ap- 
pealing bivariate analogues of univariate rank statistical methods is reached 
through the gradients of a convex objective function used to define a bivari- 
ate median; see Brown and Hettmansperger (1987a,b), Hettmansperger, 
Nyblom and Oja (1992), Hettmansperger, Möttönen and Oja (1997a,b) 
and Möttönen and Oja (1995). To show how this idea works, the notion of 


352 B. M. Brown, T. P. Hettmansperger, J. Möttönen and H. Oja 


univariate rank is set up this way in Section 2. 

There are several possible definitions of bivariate median which could be 
used to develop a notion of bivariate rank (Small, 1990; Niinimaa and Oja, 
1997). Among these, the bivariate median of Oja (1983) is affine invariant. 
The resulting bivariate ranks are called Oja ranks, and defined in Section 3. 
They lead to the idea of bivariate quantile, which is a data item or region 
or chord between data items having prescribed rank. 

The purpose of this paper is to examine the connections between and 
uses of two corresponding plots. The first, called the direct diagram, 
is just a plot of data in R? (the observation points with the lines going 
through pairs of observations). The second diagram, called an Oja rank 
plot, describes data items and regions having particular Oja rank values. 
From it, quantiles in the Oja sense can be read off. The Oja rank plot 
is developed in Section 4. In certain senses, the two plots have a duality 
relationship. Such connections, and other properties, are listed in Section 4. 
The correspondences between the two plots indicate considerable potential 
for higher dimensional versions to be useful in informal data analysis. 


2 Univariate ranks 


Given univariate data 21,...,2%n,, the median m is defined as the choice 
0 = m to minimize the dispersion criterion 


n 


S0) = Y |x: — 0l. 


i=1 


Clearly S'(0) = — >> sgn(x; — 9), and this gradient function can be used to 
define univariate rank. In order to facilitate a clear analog with the coming 
bivariate case it is convenient to define sgn(t) = +1 if t > 0, -1 if t < 0, 
but sgn(t) can take any value in [—1, +1] when ¢ = 0. If zj) denotes the 
jth order statistic, then $5’ (2,;)) is any value in [j — 1 — 3, j — 3] while in 
general, if r(;_1) < 9 < zaj), then 

n 


So eee ee Le 


The rank of the position 6 among {z;} can therefore be defined as 
1 
and —3n < R(0) < §n, with z(o) = —00 and 2(n41) = +00. 


Inversion of the rank function leads to the notion of quantile. For 0 < 
p < 1, the pth quantile £p is the solution 6 = €, of R(@) = (2p—1)(4). It is 
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easy to verify that if p = (j — 1)/n, then & is any value within [x(;_1), x(;)], 
while if (j — 1)/n < p < j/n, then & = 2(,). 

The preceding definitions of univariate rank and quantile will now be 
extended to the bivariate case, using the Oja dispersion function for a 
bivariate median. Note now that later it is convenient to define bivariate 
quantiles in a way which is notationally different from the univariate case. 


3 Oja bivariate ranks 


Now suppose that z1,...,£n all € R?. The Oja bivariate median m is the 
choice 0 = m to minimize 


S (0) = ` A(zi, Tj, 0) 
i<j 


where A(a, b,c) is the area of the triangle having vertices a, b and c. The 
corresponding gradient function is 


VS(6) = : x uļ(Ti, Tj; 0) 
i<j 


where u is a “repulsion vector”, having magnitude |z; — z;| and direction 
perpendicular to and away from the chord between z; and zj, towards 0. 
See Brown and Hettmansperger (1987a) for details. Correspondingly, the 
Oja rank of @ is defined as 


R(8) 


VS(9), 
= > S zi, 23.) 


i<j 


Note that ranks R(0) are bivariate vectors, with direction as well as magni- 
tude. The orientation of R(0) among other {R(z;)} will be roughly similar 
to that of 0 among {z;}, but in a general sense the ranks display more 
regularity than the original data, resembling the situation for univariate 
ranks. 

An important observation is that R(@) remains constant as 0 changes 
locally. Furthermore, R changes value only when @ crosses a line connecting 
some Ti, Tj. Then the increment to R(@) is tu(2z;,2;;0). This observation 
establishes the fundamental basic relationship between the direct diagram, 
the plot of data {z;}, and the Oja rank plot, of the rank values. It can be 
described as follows. 
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Figure 1: Direct diagram (upper case): Black points are original four data 
points; white points are secondary points. Tiles are numbered. Rank plot 
(lower case): Numbered vertices are rank plots of corresponding tiles. 
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Drawing in extended lines connecting all observation pairs z;, x; defines 
a natural tiling in the direct diagram. The tiles are polygons with ver- 
tices at observations, or at intersections of a line connecting z;, x; with a 
line connecting £k, zı, for some i, j, k, l. These intersections are called 
secondary points. For n original data points in the general position (no 
parallel lines), there are 3(7) = O(n*/8) secondary points. For the points 
not in the general position, the number of secondary points (and conse- 
quently the number of tiles) is smaller. 

The ranks for all 0 within a tile of the direct diagram are constant 
vectors. That constant is a point of the Oja rank plot, so rank plot points 
correspond to tiles in the direct diagram. Neighbouring tiles have rank 
values differing by a repulsion vector u. Therefore the rank value of any 
point of a boundary between tiles can be any point in the rank plot on the 
chord between the rank value points of the two tiles. 

Furthermore, in the direct diagram, n observation points and 3(7) sec- 
ondary points lie at the junction of several tiles. Correspondingly, their 
rank values are not unique, but any value in the polygonal region of the 
rank plot whose vertices are the ranks of the abutting tiles. For illustration, 
see Section 4 and in particular Figure 1. 

These observations lead to a number of further relationships between 
direct diagram and rank plot, described in the next section. 


4 Direct diagram and rank plots 


4.1 Relations between the tiles 


The regions of the rank plot are more regular than the tiles in the direct 
` diagram. Figure 1 show a case of n = 4 data points with just 3 secondary 
points. A small number of points minimizing the clutter in the figure was 
used to illustrate the duality relationship in a simple case. See Figure 3 for 
the rank plot of 10 points. 

In general, every data point has n — 1 lines emanating from it, towards 
other data points, so in the rank plot the rank region for a data point is 
a 2(n — 1)-sided polygon whose opposite sides are parallel and of equal 
length, i.e. an order-(n — 1) parallelogram. By contrast, the rank regions 
of secondary points are conventional 4-sided parallelograms. Together, the 
two types of parallelogram form an orderly tiling of R? in the rank plot. 
All sides of regions are repulsion vectors as occurring in the definition of 
Oja rank. Figure 2 provide illustrations. | 
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R(B) 


R(A’) 


R(A) 


R(B') 


R(C) 
R(D) 


R(A’) 


R(A) 


ey 
rd 


R(C’) 


Figure 2: Direct diagrams and rank plots near a data point (lower case) 
and near a secondary data point (upper case). 
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There is complete projective duality between the two plots; that is, 
points in the direct diagram are associated with regions in the rank plot, 
chords in the direct diagram are associated with chords in the rank plot, 


and regions in the direct diagram are associated with points in the rank 
plot: 


Direct plot 


Rank plot 


point tile: (n — 1) parallelogram 
secondary point | tile: 2 parallelogram 
chord chord 

extended line set of chords 

tile point 


Only the location information is lost in the rank plot: The original data 
can be recaptured from the rank plot and the value of any data point (or 
Oja median). Let B be the sample covariance matrix computed on the 
Oja ranks. The standardized rank plot is then obtained if the rank 
plot items are multiplied (from the left) by B-!/2. Both location and scale 
information is lost in the standardized rank plot. 

Note that the Oja rank vectors are location invariant and affine equiv- 
ariant in the sense that if the original observation vectors are multiplied 
by a full rank matrix A, the rank plot items will be multiplied by A* = 
abs(det(A))(A~')?. If A is orthogonal then A* = A and if A = diag(ay, a2) 
then A* = diag(a2,a,). For elliptical distributions, the eigenvectors from 
the Oja rank covariance matrix are then the eigenvectors of the conventional 
covariance matrix but the eigenvalues are reversed. The fact is connected 
to rank plot scale elongation occurring in orthogonal directions to scale 
elongations in the direct diagram; see Section 4.3. 


4.2 The rank plot boundary 


In using a rank value to assess the position of a point among points in a 
data cloud, it is useful to know the extremities of the rank plot. The rank 
plots are not standardized and the rank plot boundaries are determined by 
the data. The boundary tiles in the direct diagram each have an open face 
extending to oo, and the rank values of these tiles form the vertices of the 
convex hull of the rank plot. Plotting these vertices will delineate the rank 
plot boundary, but there is a quicker more informal method of describing 
approximately where this boundary is. 

This method, yielding an approximate boundary of the rank plot, is as 
follows. 

Consider @ far away from the original data cloud, in direction a, 1.e. 
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approximately 


with r — oo, where Ô is the Oja median. For r large, R(@) does not 
depend on r. Then the contribution to R(@) in direction a of the vector 
(1/2)u(x;, £j; 0) is approximately of magnitude 


1 I 

zriölsinla — aij)|, (1) 
where rij = |x;—2,| and the line joining z;, x; has direction a;; (or a;;+7). 
The sum of terms like (1) is awkward because of the absolute value, but a 
smooth approximation comes from using 


1 
|sinz| ~ a + 5(1—a)(1 — cos2z), —T<2n<T7 (2) 


Any value of a with 0 < a < 1 may be used; the error of approximation 
varies between a at zx = 0 and (—1/2)(1 — 2a)*/(1 — a) when |sinz| = 
(1/2)(1 —a)~!. The minimax error is (2 — 21/2) /4 = 0.1464 at a = (2 — 
21/2) /4 = 0.1464 and this is a convenient choice for a. Then summing over 
terms in (1), using (2), gives 


R(0) = a(i +a) +a) X rij — (1 — a)y cos(2a — 2w), 


i<j 


1 , . 
= mC +a)R- “(1 — a)y\cos(2a) cos(2w) — sin(2q@) sin(2w)], 


where 
R= ` Tij, 
i<j 
y? = 63 Tij cos(2a;;))? + 0 Tij sin(2a;;))’, 
i<j i<j 

cos(2w) = 77? A Ne Tij COS(20;;) 
and 

sin(Qw) = y7? ` Tij Sin(20%;). 


The parameters cos(2w), sin(2w), w, y and R are easy to calculate and 
along with a, describe R(@) as a simple cosine function of a, minimum 
at a = w tr and maximum at a = w. The corresponding shape of the 
approximate rank plot boundary is approximately an ellipse whose major 
and minor axes give a rough indication of the principal components of the 
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cloud of Oja ranks. See Figure 3 for the rank plot of a data set of 10 
observations. 


1.5 


1:0 


0.5 


0.0 


-1.5 
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 


Figure 3: Rank plot for a data set of ten bivariate observations. 


Note that 
, 1 Dic; |Z32 — Ti2| ) 
l R(0)= = a A ; 
ie = 5 (_ ic; Stgn(xj2 — Ti2)(£j1 — Fit) 


If £1,...,£n are i.i.d. from a spherical bivariate distribution with marginal 
Gini mean differences T = E(|x11 — z21|) = E(|z12 — £22|) then clearly 


=] 
n Le 
pees 8 RUES (5) 


and, for 


~1 
n T (cosa 
= H RUES ips 7 l 


The approximative boundary then is the sphere 


{ Su > ulu=1}. 
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For observations from an elliptical distribution 
PC x, sug Oba 


where P is orthogonal and C = diag(cl,c2) diagonal, the asymptotic 
boundary is then the ellipse 


{ zPO*u : ulu=1}, 


where C* = diag(c2,cl). Major and minor axes give the principal compo- 
nents for the original bivariate distribution. 


4.3 Slopes of rank regions 


In the rank plot, the rank region of a data point is the order (n — 1) 
parallelogram of rank values attributable to that point, whose vertices are 
the rank values of all the tiles in the direct diagram which abut at the point. 
There is considerable information in the shape of a rank region as to the 
position of a data point within a data cloud : The lengths and directions 
of the n — 1 chords surrounding the rank region give the distances and 
directions of the other n — 1 points in the direct plot. (The direction is 
perpendicular (+4) to the direction of the chord.) Figure 4 illustrates how 
the rank region of an outlier will be elongated in a direction perpendicular 
to the direction of the rest of the data. 

If a rank region has sides predominantly of one direction, the rest of the 
data is mostly oriented in a perpendicular direction from the data point 
in question. The lengths of the sides of a rank region are proportional to 
distances to other data points. Thus an outlier is distinguished by having 
a rank region with long sides, all with a similar direction. 

Other remarks can be made. 

(i) If all rank regions tend to have sides of similar direction, then the 
whole data cloud is elongated in a perpendicular direction. 

(ii) Data points towards the center of a data cloud tend to have rank 
regions whose sides are of mixed lengths. The directions will reflect the 
general orientation of the data cloud. 

(iii) Other data cloud patterns will have corresponding rank plot fea- 
tures. For instance, two separated mini-clouds yield a rank plot whose rank 
regions have sides tending to be distinctly short, in assorted directions, or 
distinctly long, in a definite direction perpendicular to the direction be- 
tween the mini clouds. 
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Figure 4: Direct diagrams and data plots for six observations: In the 
upper case observation d is moved to be an outlier. 


362 B. M. Brown, T. P. Hettmansperger, J. Möttönen and H. Oja 


References 


[1] Brown, B. M. and Hettmansperger, T. P. (1987a). Affine invariant rank 
methods in the bivariate location model. J. R. Statist. Soc. B 49, 301- 
310. 

[2] Brown, B. M. and Hettmansperger, T. P. (1987b). Invariant tests in 
bivariate models and the L criterion. In Statistical data analysis based 
on the L,-norm and related methods, Ed. Y. Dodge, pp. 333-344. Am- 
sterdam: North Holland. 

[3] Hettmansperger, T. P., Möttönen, J. and Oja, H. (1997a). Affine in- 
variant multivariate one-sample signed-rank tests. J. Am. Statist. As- 
soc. To appear. 

[4] Hettmansperger, T. P., Möttönen, J. and Oja, H. (1997b). Affine in- 
variant multivariate rank tests for several samples. Statistica Sinica. 
Conditionally accepted. 

[5] Hettmansperger, T. P., Nyblom, J. and Oja, H. (1992). On multi- 
variate notions of sign and rank. In Lj-statistical Analysis and Related 
Methods, Ed Y. Dodge, pp. 267-278. Amsterdam: North-Holland. 

[6] Liu, R.Y. (1990). On a notion of data depth based upon random sim- 
plices. Ann. Statist. 18, 405-414. 

[7] Liu, R.Y. (1992). Data depth and multivariate rank tests. In Lı- 
Statistical Analysis and Related Methods, Ed. Y. Dodge, pp. 279-302. 
Amsterdam: North-Holland. 

[8] Liu, R.Y. and Singh, K. (1993). A quality index based on data depth 
and multivariate rank tests. J. Am. Statist. Ass. 88, 252-260. 

[9] Möttönen, J. and Oja, H. (1995). Multivariate spatial sign and rank 
methods. J. Nonparam. Statist. 5, 201-213. 

[10] Niinimaa, A. and Oja, H. (1997). Multivariate median. In Encyclopedia 
of Statistical Sciences. To appear. 

[11] Oja, H. (1983). Descriptive statistics for multivariate distributions. 
Statist. Probab. Lett. 1, 327-332. 

[12] Small, G. (1990). A survey of multidimensional medians. Inter. Statist. 
Rev. 58, 263-277. 


L, -Statistical Procedures and Related Topics 
IMS Lecture Notes — Monograph Series (1997) Volume 31 


Target estimation and implications to 
robustness 


Luisa Turrin Fernholz 


Temple University, Princeton, USA 


Abstract: This paper considers target functionals T, which are bias- 
reduced functionals that can be obtained from a functional T in a para- 
metric setting. It is shown that the Lı-error of the corresponding target 
estimator decreases and the asymptotic normality is obtained using von 
Mises expansions with the Hadamard derivative. It is also shown that tar- 
geting can improve robustness since the gross-error sensitivity decreases 
under certain conditions. Applications to M-estimates of location, the 
sample median, and simultaneous M-estimates of location and scale are 
given. 
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1 Introduction 


Target estimation was introduced by Cabrera and Fernholz (1996) as a 
procedure to reduce the bias and the variance of an estimator. In that 
paper, the von Mises expansion of a statistical functional was used to obtain 
conditions for this reduction in bias and variance. Moreover, it was also 
shown that the bias of the target estimator can be expressed in terms of 
the influence function of the original statistic. It is natural to ask what the 
relationship is between the von Mises expansion of the original functional 
and that of the target functional. This issue is addressed in the present 
paper where the asymptotic distribution of the target functional will follow 
from the corresponding von Mises expansion. The influnce function of the 
target functional is also obtained and a condition is given for reducing the 
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gross-error sensitivity. 

The paper is organized as follows: In Section 2 we give a review of the 
basic ideas behind target functionals and we consider the Lı error and 
the mean square error of these functionals. In Section 3 we look at the 
asymptotics and robustness of target functionals through the von Mises 
expansion and the influence function. Applications to M-estimates of lo- 
cation, the sample median, and simultaneous M-estimates of location and 
scale are given in Section 4. 

Throughout this paper we shall assume that T is a statistical functional 
and the statistic T(F,) estimates the parameter T(F9), where Fn is the 
empirical d.f. corresponding to the sample Xj,...,Xn of iji.d. random 
variables with common d.f. Fg, with 0 € ©, for © an open subset of the 
real numbers. When a statistical functional T satisfies T(Fy) = 0, the 
functional is said to be Fisher consistent. 

We shall also assume that the expectation of T (Fn), gn(@) = Eo(T(Fn)), 
exists for all 0 € ©, where Eg indicates the expectation with respect to Fo. 
Moreover the function gn will be assumed to be one-to-one and differen- 
tiable. 


2 ‘Target functionals 


Definition 1 Let gn(0) = Eo(T(Fn)) be a one-to-one function. The func- 
tional Tn induced by T from the relation 


g, (T) P T, (1) 


will be called the target functional of T. The statistic Tp(F,) will be called 
the target estimator . 


The above definition was introduced by Cabrera and Fernholz (1996) where 
the goal was the reduction of the bias and the variance of an estimator. Note 
that when gn(0) = a0+b for a # 0, then the corresponding target functional 
is Ta = (T — b)/a, which will always be unbiased and its variance will be 
reduced if and only if a? > 1. The variance of T, will remain unchanged 
when |a| = 1. 

It should be noted that Rousseuw and Ronchetti (1981) used the function 
gn to generate functionals g7 +(T) in an entirely different context to study 
the influence curve of statistics used in testing hypotheses. 

The following two theorems refer to the reduction in the bias B=(@) and 
mean square error of target functionals. 
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Theorem 1 [f for a statistical functional T the function gn(0) = Eo(T(Fn)) 
satisfies 


(i) 0 < Jn(0) 
(ii) 1 < (0) <b 
(iii) O < gp(A) 


then 
a) Eo(Tn) < 0 < Eo(T) and b) |(Bz (8))| < |(Br(8)l. 


Theorem 2 IfT is a statistical functional with variance Vr and gn(0) = 
Eo(T'(Fn)) is differentiable with |g ,(0)| > 1 for all 0 € ©, then the mean 
square error of Tn satisfies 


MSEx < Vr. 


The proofs of the above theorems can be found in Cabrera and Fernholz 
(1996). The other theorems in the current paper are new. B 
The following result refers to the L,-error of the target estimator Thn. 


Theorem 3 [fT is a statistical functional and gn(0) is differentiable with 
\9;,(0)| >1 for all 8 €O , then 


Eo|Tn — 0| < Eo\T — Eo(T)|. 
Proof: Using the mean value theorem for gn we have 


9n(In) = 9n() + (Tn — 0)g;,(€) 
for some £ between 0 and Tp(Fn). Since by definition 9n(In) = T, we have 


a) ae 
OCS 

By taking absolute values and their expectation of this last equation, we 

obtain 


E|T, —9| = E|1/g,(€)\IT — 9n(9)| 
< E|T- E(T)I, 
and the theorem is proved since |g/,(@)| > 1. O 


An immediate consequence of this theorem is the following corollary that 
relates the L1-errors of T and T with the median, Mo(T), of the distribution 
of T. 
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Corollary 4 Under the conditions of Theorem 3 we have 


E|Tn — 0| < EJT — 6| + |M (T) — Eo(T)| 
Proof: It follows directly from Theorem 3 since 


B\T,-6| < E\T —Mo(T)| +|Mo(T) — Ee(T) 
< E|T - 6|+|Mo(T) - Eo(T)| 


by the £1-minimizing property of the median. O 


This corollary shows that when T has a symmetric sampling distribution, 
the target functional T has smaller Lj-error. 


3 Asymptotics and robustness 


The results presented in Section 2 refer to the statistic TG ). But for 
each target functional T, we can consider the statistic T n(Fm) where n and 
m do not necessarily coincide. For this purpose, let Fy be a parametric 
family and T be a Fisher consistent statistical functional. For each n and 
gn(0) as above, we obtain the sequence of target functionals Tp = g7(T), 
each one of them generating statistics T, (Fm) for a sample Xj,...,Xm. 
Let T be a Fisher consistent functional with influence function 1 (see 
Hampel, 1974). For a sample Xj,...,Xm from Fg, the first order von 
Mises expansion of T (Fm) is 


T(Fm)=0+ = 5> o1( Xi) + Remm. (2) 
i=1 


With Hadamard or Frechet differentiability under certain regularity con- 
ditions, We have Remm = op( m7 2) and T is asymptotically normal with 
variance o? = E(¢;(X))? (see Fernholz, 1983). 

The following result gives the von Mises expansion of T, 


Lemma 5 Let T be a statistical functional and for a fixed n let T, be the 
corresponding target functional. If T is Hadamard differentiable at Fg with 
von Mises expansion as in (2), then Tn is also Hadamard differentiable at 
Fo with von Mises expansion 


Ta lEn) = 970) + E SOLOA AX) + Berm. 8) 
i=1 
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Proof: First note that T,(F9) = 9,!(T(Fe)) = g (0). Since T is Hadamard 
differentiable, the composition g71(T) = T, will be Hadamard differen- 
tiable for a differentiable real function g, by the chain rule (see Fernholz, 
1983). Therefore the influence function of Tp is (1/9/,(0))¢(x) and the 


lemma is proved. O 


In general we would be interested in using target estimates when T is such 
that gn(@) # 0. This clearly implies that when T is Fisher consistent then 
In will not be. However, Tn has less bias than T when gn satisfies certain 
conditions. 

When n = m, the expansion in (3) provides a linear approximation of 
the statistic T;,(F,) with the influence function given by (1/g/,(0))d1(z). 
If we now compare the influence functions of Te and T we can conclude 
immediately that, when |g/,(0)| > 1, Tn is more robust than T in terms of 
gross-error sensitivity (see Hampel et al., 1986) since we have 


sup |(1/gn(8))41(2)| < sup |gr(2)|. 


When the function g(@) = að + b is linear, T = (T —b)/a and the cor- 
responding influence function satisfies ¢1(z) = ġı(x)/a. The gross-error 
sensitivity of T will be smaller than that of T when |a| > 1. 

The expansion in (3) above is useful to derive the asymptotic normality 
and efficiency of Th as we see in 


Theorem 6 Let T and T, be as in Lemma 4. If T is Hadamard differen- 
tiable at F = Fg with von Mises expansion as in (2) then, for fixed n and 
m— co we have : 

a) T, is asymptotically normal with 


Vm (Ta(Fm) —Trn(F)) + N(0,02) 


where the asymptotic variance of Tn is o2 = (1/g/,(0))20?; 
b) If \g/,(0)| > 1, In is asymptotically more efficient than T. 


Proof: Part a) is an immediate consequence of Lemma 5 since the Hadamard 
differentiability of Ta implies the asymptotic normality of Tn (see Fernholz, 
1983). S 
For part b), note that when |g),(@)| > 1, the asymptotic variance of Tn, 
satisfies 

o% = (1/9,(0))?0? < 0?. O 


Theorem 6 refers to the asymptotic normality of T,(Fm) for fixed n when 
m tends to infinity. When n = m is large, we have 
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Theorem 7 Let T and T, be as in Theorem 6. If m =n tends to infinity 
and for all 6 \/n(gn(@) — 9) — 0 , and ./n(1/gi,(0) — 1) —> 0, then for all 0 
it holds that 


Jn(T, (Fr) — 0) N(0,0?). 


Proof: From the expansions (2) and (3) above we obtain 


Jn(Tnr(Fn) — 9) Jn(Tn(Fn) — T(Fn) + a 0) 


Vn(Ty(Fn) — T(Fn)) ea ) + /nRemy.(4) 


i=l 


Il 


Since T is Hadamard differentiable, we have 
JnRemn——0 


and 


LHe -P ,N(0, 0°). 


Hence, it suffices to show that the first term in (4) converges to zero in 
probability. But in 


1 
Jn 4 


Ja (Fn) - T(Fn)) = Valz (0) — SI (1/940) — DA (Xi) 


+vyn(Remn — Remn) (5) 


the first term tends to zero by hypothesis and the third term converges to 
zero by the Hadamard differentiability of T and Tn. For the second term 
n (5), Markov’s inequality implies that for any € > 0 


n 


P{| Ze X00) - DAX > €} > VEIA) — 1E], 


i=1 


which tends to zero when n tends to infinity since by hypothesis 
/n(1/gi,(@) — 1) —> 0 and the theorem is proved. O 


From Theorems 6 and 7 we can see that, when n = m is large, there is little 
gain for the target estimators in terms of gross-error sensitivity. However 
these theorems insure that there will be no loss in gross-error sensitivity 
when we use targeting to reduce bias. 
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4 Examples 


4.1 M-estimates of location 


Since in this section the sample size n will always be fixed, we shall simplify 
the notation by using g(0) instead of g,(0) and T for the target functionals. 
For a parametric family Fg, let the functional T (Fy) = @ be defined implic- 
itly as a solution of 


~” 


J Ve- aF(a) = 0. 


The corresponding statistic T (Fn) is the well known M-estimate of location. 
When the parametric family satisfies Fọ(x) = F(x—90) for all 0 and some d.f. 
F, the corresponding M-estimate is location equivariant, that is T (Fẹ) = 
0 +T(F). It was shown in Cabrera and Fernholz (1996) that the bias of 
an M-estimate of location is constant. So g(0) = 0 + B, and the influence 
functions of T and T coincide. 


4.2 The sample median 


The functional T(F9) = Fj (1/2) = Ô corresponds to the sample median 
T(Fn) = Fy1(1/2). When Fo is not symmetric about 0 the sample median 
is biased. The second order von Mises expansion of T (see Fernholz, 1996; 
Fankhauser, 1996) gives 


g(8) = Eo(F,*(1/2)) 


(6 
- aay toll). 


Its derivative is 


(f/(0))* — FO) f" (0) 


SOSIE FO) 


+o(1/n), 

hence for @ such that f” (0) < 0, we have g'(@) > 1. In this case T satisfies 
the hypotheses of most of the theorems presented above. The target estima- 
tor g7 1}(F7+(1/2)) will have smaller bias, MSE, and gross-error sensitivity 
than Fz+(1/2). 


4.3 Simultaneous M-estimates of location and scale. 


Consider a family of d.f.’s Fg with 6 = (y,0) and the two-dimensional 
functional T(F9) = (Tı (Fo), T2(Fe)) defined implicitly by 
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where y = (y1, Y2). The corresponding statistic T(Fn) = (Tı (Fn), Ta(Fn)) 
satisfies a system of equations and is called an M-estimate of location and 
scale. See Huber (1981) and Hampel et al (1986). 

When the family of distributions is such that F(z) = F(=) for some 
fixed d.f. F, the functional T satisfies T(F o) = (wt+oTi(F),oT2(F)) and 
is said to be location-scale equivariant (see Hampel et al , 1986). In this 
case, it can be shown that 


9(9) = glu, 0) = (u + oC, oC) 


where Cı and C% are constants independent of u and ø. Now, the target 
functional is T = (T4, T2) with 


T = T,-—(C4/C2)To 
T (1/C2) To, 


and so the corresponding two-dimensional influence function is $ = (dr, b2) 
where 


b1 (x) = ¢1(x) — (C1/C2) $2(z) 
polz) (1/C2) ġ2(x). 


If we let || || denote the Euclidean norm, then for each x 


Bw = Ewlo 
Id1(2) — (C1/C2) $2(2)/? + |(1/Cr) bo(2)? 
|o1 (a) |? + |(C1/C2) b2(x) |? + |(1/C2) b2(x)|? 
(2) 
(2) 


2 


lA 


[p1 (x) |? + ((C1/C2)? + (1/C2)) |b2(x)|? 
|i (a) |? + |¢2(x)|? 
\|o() ||? 


when C?+1 < C2. Therefore the gross-error sensitivity of T will be smaller 
than that of T when C? + 1 < C2. 


AN 


5 Closing remarks. 


We should note that the regularity conditions of the theorems in Sections 
2 and 3 are not too stringent as shown by the examples and applications of 
target estimation presented in Section 4 as well as in Cabrera and Fernholz 
(1996). Moreover, the hypotheses of Theorem 7 are not as restrictive as 
they may seem. The condition (g,(@) — 0) = o(1/,/n) is equivalent to 
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Eg(Remn) = o(1/./n) which is quite plausible since Rem, = op(1/,/n) 
for Hadamard differentiable functionals. The condition referring to the 
derivative of gn is also satisfied by the most reasonable statistics. The 
examples presented in Section 4 all satisfy these conditions. 

Target estimation is a computer intensive procedure that has proved to be 
very effective in reducing the bias and the variance of estimators. Target 
estimation can also be performed when the function g,(0) is the median 
of the statistic, as was first presented in Cabrera and Watson (1997). The 
examples given in Cabrera and Fernholz (1996) and Cabrera and Watson 
(1997), as well as the applications to practical problems in computer vision 
of Cabrera and Meer (1996), reveal that target estimation is an effective 
method of bias and variance reduction in many situations. This paper 
shows the additional gain in robustness due to targeting. 

Although the problems of bias reduction and robustness of an estimator 
seem to be two independent issues, we showed in this paper that the con- 
ditions for bias and variance reduction will assure a smaller bound for 
the influence function of the bias-reduced functional, which means smaller 
gross-error sensitivity for the target estimator. The von Mises expansions 
proved to be a powerful tool to approach the theoretical issues of target 
estimation. Simulations and practical applications need to be performed 
now to evaluate more precisely the gain in robustness when estimators are 
targeted. This is a topic of ongoing research. 
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1 Introduction 


In this paper we will develop a general methodology for nonparametri- 
cally testing the goodness-of-fit of a parametric or a semiparametric model. 
To begin with the simplest example, assume one observes independent 
identically distributed (i.i.d.) random variables X1,...,Xn on the real 
line, from some unknown distribution function (d.f.) F. Furthermore, let 
F = {Fg: 0 € O} be a given family of distribution functions parametrized 
by some vector 0 € © C IR*. To keep the discussion as simple as possi- 
ble, we will assume that no nuisance parameters are present so that Fọ is 
uniquely determined by 8. The problem of how to test for the hypothesis 


Ajo: FEF 


has attracted many researchers over the past decades. Most of the test 
statistics are certain functionals of the underlying empirical process. More 


374 Winfried Stute 


precisely, denote with 
n 
F(z) =n be Ley, <a}s x ER, 
i=1 


the empirical distribution function of the data. The by now classical invari- 
ance principle of Donsker (1952) then asserts that the empirical process 


n(x) =n"? [Fa (x) — F(a)], (1) 
in the Skorokhod space D[—oo, co], converges in distribution to 
eo = Bo F. 


Here, B? is a Brownian Bridge on the unit interval, i.e., a centered 
Gaussian process with covariance function 


Cov([B°(s), B°(t)] = min(s, t) — st. 


For details and extensions, see Gaenssler and Stute (1979) and Shorack 
and Wellner (1986). To test for a simple hypothesis, F = F%,, one needs to 
replace F in (1) by F%, so that under Ho 


a = nF — Fo| 


equals an. In particular, critical values if not available for finite sample 
size may be obtained from the distribution of the limit ag. For com- 
posite hypotheses, things unfortunately become more complicated. Under 
Ho, F = Fg, for some unknown 0 € O, the true parameter. Since now 6o 
remains unspecified, it needs to be estimated from the data by some Opn, 
say. We thus come up with the so-called empirical process with estimated 
parameters 
Gy, =n? iF, — Fo,). 


This process may be viewed as a basic device to measure the deviance 
between a completely nonparametric and a parametric fit. It has been ex- 
tensively studied by Durbin (1973). To briefly recall its ingredients, assume 
that 6, has, under Ho, a linear expansion 


n? (On — b0) = n7"? Y (Xi, 00) + op (1), 
i=1 


where l is a proper vector-valued function with expectation zero and finite 
covariance matrix. Then, under appropriate smoothness assumptions, 


din(#) = an(2) — G(x, 60) | Iu, 2o)an(dy) + op(1) 
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uniformly in x, where 


OF (x) 


G(x) = G(x, 8o) = 30 7 : 


From this we readily get 
GQ, > B}? o F — GV =a. 


with 
V= fi% 00) B? o F (dy). 


The limit â% is again a centered Gaussian process, but its covariance func- 
tion is more complicated, and tables for critical values may and will depend 
on ĝo and are not readily available. In such a situation a parametric boot- 
strap may offer a useful possibility to approximate the distribution of ân 
under Ho; see Stute et al. (1993). 

Though from a computational point of view, this seems to be quite 
satisfactory, it is worthwhile considering also another approach which not 
only provides an approximation in distribution, but also leads to a deeper 
understanding of the involved processes. For ân, this approach has been 
initiated, in a landmark paper, by Khmaladze (1981). As to this, recall 
that B? has the representation 


B?(t) = B(t)- tB), O<t<1, 


in terms of a Brownian Motion B and, vice versa, 
B? 
= Bt) + [Pa Hoa : (2) 


In the latter equation B may be viewed as the innovation martingale and the 
integral as the compensator in the Doob-Meyer decomposition of B°. Now, 
Khmaladze (1981) was able to also find the corresponding decomposition 
for Goo. Replacing Ga by its innovation martingale then leads to a new 
process, say Tâ, which is a Gaussian martingale and hence a Brownian 
Motion w.r.t. proper time. In particular, this process is distribution-free 
modulo a transformation in time and therefore is a good candidate for 
giving rise to distribution-free test statistics. 

It is the purpose of the present paper to extend Khmaladze’s (1981) ap- 
proach to a much more general setting. This will enable us to design model 
checks in the context of regression, times series, multivariate analysis and 
survival analysis, among others. Now, rather than (2), our starting point 
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will be the following representation of B° in terms of B, which incorporates 
a transformation in time and a scale factor: | 
t 
B(t) = (1 -t)B (=) (3) 
l-t 
To show that the right hand side has the same covariance structure as B®, 
just use the monotonicity of the time transformation and apply 


Cov|B(s), B(t)] = min(s, t). 


Monotonicity will also be a crucial issue in the examples which will be 
shortly discussed. In each case the limit process will be of the following 
type: 

R = GiB O yp = G$V. (4) 
Here, G; and G2 are two deterministic functions, 7 denotes the aforemen- 
tioned nondecreasing nonnegative time transformation and V is a normal 
vector, which may and will depend on B. Conclude from the introductory 
remarks that for Roo = oo, i.e., for the empirical process with estimated 
parameters, 

G,=1-F ~=F/(1-F) 

and 


G2 = — at 0 = bo. 


In our second example we discuss a situation which typically comes up when 
the X-data represent lifetimes. Under random right censorship one ob- 
serves, due to other causes of failure, variables Z; = min(X;,Y;),1<i<n, 
where the censoring variables are independent and also independent of the 
X’s, with the common d.f. G. Also available are 0-1 variables 6; = li x;<y;} 
indicating whether X; has been observed or not. Since under censorship Fn 
may not be available, it needs to be replaced by the nonparametric MLE 
adapted to the new framework: 


z = Õli:n] H Zin Sz} 
i=1 


This is the famous product-limit estimator due to Kaplan-Meier (1958). In 
(5), Zin < -.. < Zn:n are the order statistics of the observed Z’s. Finally 
dti:n} denotes the 6-variable associated with Zj.n. Note that Fn boils down 
to Fn if all 6’s equal one. Breslow and Crowley (1974) extended Donsker’s 
invariance principle to the present setup. They showed that the so-called 
Kaplan-Meier process 


Balx) = nt! ifn (x) — F(x) 
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converges in distribution to a centered Gaussian process 355. In our nota- 
tion it admits the representation 


Bo = (1 — F)B o C, 


where, under a continuity assumption, 


2- rat am =o Fiddy 
(1— F(y))?(1-G(y)) 


Hence, in terms of (4), the Kaplan-Meier process with estimated parameters 
converges to Rə with 


G,=1-F and PE 


The function G2 is the same as before. A detailed analysis of this example 
may be found in Nikabadze and Stute (1997). 

In our next example, we will discuss the important problem of model 
checks in regression. For this, let (X,Y) be a bivariate random vector such 
that IE/Y| < oo. Denote with 


m(x) = E{Y|X = x} 


the regression function of Y w.r.t. X = x. Also, let M = {mg : 0 € O} be 
a given parametric family of candidates for m. For example, the mg’s may 
consist of all functions spanned by a given basis gj,.-., gk: 


mo(x) = bıgı (£) +... + Ongn(Z). 


This includes, e.g., all polynomials or trigonometric polynomials with a 
given bound on the degree. To test for the hypothesis 


Ho : m € M, 
let 6, be, under Ho, any estimator of 99, computed from a sample of inde- 
pendent replicates of (X,Y), admitting a representation 
n 
n12 (On — bo) = n7"? X (Xi, Yi, 90) + op (1). 
i=1 


The residuals 
éin = Y; — mo, (Xi)  1<i<n, 
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traditionally play an important role in model diagnostics for regression. In 
our approach they will be embedded into a marked point process 


n 
An(x) = p 2 S enlie <2}, cel. 
i=1 


Under Ho, one can show that 
Ân > Yoo = Bow — G5V, 


where 
T 


Wa) = | PUF) 


and 
T 


G(x) = / org) (du) at 0 = 0o, 


— OO 


with o?(u) = Var{Y |X = u} denoting the conditional variance and F being 
the marginal distribution of X. See Stute (1997) and Stute, Thies and Zhu 
(1996) for details. We thus see that (4) applies again with Gi = 1 and 
Go,w from above. 

Another example to which our methodology will apply is in a time series 
context. For this, let X1, X2,... be a stationary sequence of observations. 
We are interested in the dynamics of the process. One possibility would be 
to decompose a future observation X; into the part explained by the past 
observations and the t-th innovation: 


X; = m(Xi—1, Xi-2, és .) Tei 


Thus m is the regression function of X; given Fj. = o(Xi-1, Xi-2,-...). 
If we are, e.g., interested in testing whether the X-sequence is first order 
autoregressive with m € M, a pre-specified parametric model, we could 
form, similar to the regression case, the process 


n 
bn(x) =n 1/2 ` [Xi — Mo, (Xi_1)| ley, _<z}- 
i=1 
Due to dependencies some little extra work is needed to show that also 
in this case ôn — 60, where doo is of type (4) with G; = 1 and some 
appropriate ~ and G2.. Note that the stationary distribution now also 
depends on 69. See Koul and Stute (1997) for details. 
Our final example concerns a generalized linear model. Here one ob- 
serves a sequence of multivariate data (Xi,Y;),1 < i < n, from IR**?}, 
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for which it is assumed that the regression function of Y; given X; has a 
decomposition into a linear form of X, and a specified link function h: 


m( x) = E[Y |X: = z] = h(0iozı +... + Okotk). 


The associated process for testing that m is of this form becomes 


Ên (T) = nV? ely = h(< On, Xi >)|l{<0n,Xi><T} 


i=1 


where Ôn is an estimator of 09 = (610,...,9%0) and < -,- > is the scalar 
product in IRF. 

Again it can be shown that under standard regularity assumptions ên 
in the limit is of the form (4). The case when h is unspecified requires 
nonparametric estimation of the (univariate) link function. 

This list of examples indicates that the class of Gaussian processes con- 
sidered in (4) is rich enough to cover many interesting cases which typically 
appear as limit processes when parameters need to be estimated. Since their 
distributional character is not readily understood, we propose to transform 
Roo from (4) to another process, which has a much nicer structure, namely 
a Brownian Motion in proper time. This will be the content of the following 
section. 


2 ‘Transformation of Gaussian processes 


As we have seen in the first section Gaussian processes of type (4) 
Ro = G1B o 4 — GV 


frequently appear as limits of certain marked empirical processes when 
parameters need to be estimated. In this section we introduce a transfor- 
mation T which maps Rə into a Brownian Motion in proper time. This 
transformation will be a composition of two linear operators T) and Th 
which will be defined now. 

Assume that G4 is a function of bounded variation which is positive 
on its support. For the sake of simplicity only a continuous G, will be 
considered. Put 


Tilo) = (0) - | Eca) (6) 


Here f varies in the class of functions for which the integral is defined. 
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Lemma 1 The stochastic process T;G,Bow is a Brownian Motion w.r.t. 
time 


p(x) = J Gi(y)b(dy). 
Proof: We have 
TıGı B O p(x) = GiB O p(x) = / Bo wdGy = / GidB O p. 


It follows that 7,;G,B ow is a centered Gaussian process with covariance 
function min{y(21), p(x2)} at 21,29. O 


For the empirical process and the Kaplan-Meier process the function G1 
equals 1 — F so that 
f 
Tif = ——dF 


which corresponds to (2). For the other examples, G; = 1 in which case 
the integral in (6) vanishes and T} reduces to the identity operator. 

Since T; is a linear operator and since V does not depend on z, we 
obtain 


TRæ = 1G,Boyp— TıG3V 
= Boyp—(TG2)'V = B 0o — G5V, 


say, where 
G3 = TiG = Go — C2 AG, 
Gi 
ff [dGz Ge 
= J| GE] 


provided the Radon-Nikodym derivative of Gə w.r.t. G1 exists. In the next 
step we construct a linear operator To with the following two properties: 


ToG3 =0 (7) 
T2Boy=Boy _ in distribution. (8) 
Putting T = To oT, we therefore get in distribution 
TR» = T(B o0- GEV) = Boy, 


i.e., TRæ is a Brownian Motion w.r.t. time ọ. 
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To define Tz, let R2, = Boy be a Brownian Motion w.r.t. time y. Also, 
let G be a given vector-valued function. Define the matrix 


w= JE (Ea 


Taf (£) = f(x) - / (Z) (y) A~ (y) | J T Ofa) p(dy) (9) 


and 


assuming that A is nonsingular. 


Lemma 2 We have 


(i) TE = 0 
(ii) TR? = R in distribution 


Proof: (i) is trivial; as to (ii), we have for s < t, 


Cov[T2R0(5), T2Ros (t)] = E[RY (5) Reo()] 


5 eiro | J (£) (y)A~*(y) IE — (z) Ro, (dz) san 
= eiro j (£) (y)A~*(y) J [ Foras) an) 
aie el | f (2) o 1) w| AG Rela) olan) 


E (R ae) A~ (y2) (Z) nod) 


The first expectation equals y(s), while the second is easily seen to be 


j (E) wa 70 [Bi Te )oldz)o(dy). 


Finally, the third and fourth expectations equal 


j (F) ware | oea) 
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and 


J J & =) A “1 gn)Alon V y2) A7 (va) (wn) Cdyn) (a), 


respectively. Summation and an application of Fubini complete the proof. 
0O 


Theorem 1 Assume that 


Define T, through (6) and Tz through (9), with G = G3. Then T = Th oT, 
satisfies 


TR. = Boy in distribution. 
Proof: Apply Lemma 1, (7) and (8). O 


We now briefly discuss further issues needed before Theorem 1 can be 
applied to a real data situation. Let R, be one of the processes G, — En 
considered in the previous section, or any other marked empirical process 
admitting a limit Rə as given in (4). The next step to verify is that along 
with 

Rn > Roo 
one has 
T Rai = Ths =B og: (10) 


Finally, observe that typically T incorporates quantities which are unknown 
in practice and need to be estimated from the data. Hence we come up 
with a random operator Thn, for which it remains to show that 


InRn > Bog. (11) 


The proof of (10) and (11) requires some extra work and uses special prop- 
erties of the underlying processes. It is therefore beyond the scope of the 
present paper. For the aforementioned examples technical details as well 
as simulation results may be found in the cited papers. 

We finally discuss an application of (11) which is designed to derive tests 
for Ho when the alternative is specified. As has been noted by Stute (1997) 
in the regression case, the Radon-Nikodym derivative of the underlying 
test process Rp w.r.t. the hypothesis and local alternatives may often 
be expressed, in the limit, in terms of the principal components of Rə 
While these are not readily available and some numerical work is required 
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to approximate them, the transformed processes converge to a Brownian 
Motion, for which the principal components are readily available. In other 
words, Theorem 1 together with (11) may be used to yield optimal Neyman- 
Pearson tests for composite models when local alternatives are specified. 
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Abstract: An asymptotic representation of the mean weighted integrated 
squared error for the kernel estimator of the density under the Koziol- 
Green model of proportional censorship is obtained for a bootstrap re- 
sampling method. A new bandwidth selector based on the bootstrap 
is consequently proposed. Simulation results for different models using 
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haves appreciably better than the classical cross-validation method. Fi- 
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1 Introduction 


A typical situation in survival analysis: let Y be the variable of interest with 
continuous distribution function F', let C be the right-censoring variable 
with continuous distribution function G, and let (Z, 6) be the observed pair, 
i.e., Z = min(Y,C) and ô = ljy<c}. The general random censorship model 
assumes the independence between Y and C. Hence E(6) = .(1—G)dF 
and the (continuous) distribution function H of Z satisfies 1 Las (1 - 
F)(1-—G). 


The Koziol-Green model of proportional censorship (Koziol and Green, 
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1976) is an interesting sub-model of the above one obtained by imposing 
the additional parametric assumption 


1 — G = (1 — F)f for some £ > 0. (1) 


A crucial fact about (1) is that the independence between Z and ô char- 
acterizes the model (Sethuraman, 1965); this allows to construct hypothesis 
tests about such a model (see, for example, Herbst, 1992; Henze, 1993). 

Under (1) it is true that 1 — F = (1 — H)’, with 0 = (1 + 8)7! = E(ô). 
This relation motivates the ACL (Abdushukurov, 1984; Cheng and Lin, 
1984) estimator 


1 — Fn = (1 — Hn), (2) 


where Hn is the empirical distribution function of the Z’s and n is the 
sample mean of the 6’s, given the initial sample {(Z1, 61), ---, (Zn, ôn)}. The 
estimator (2) is the maximum -likelihood estimator of the survival function 
of interest under the model (1). 

Here we are interested in the estimation of the density f of Y (assumed 
to exist). A kernel estimator is defined in Section 2. How to choose the 
bandwidth for it is our main question. Different procedures have been 
considered and studied in the uncensored case. See Cao et al. (1994) 
for a comparative study. Least squares cross-validation (LSCV) has been 
adapted to density estimation under proportional censorship by Ghorai and 
Pattanaik (1993). These authors have established asymptotic optimality, 


in the sense 
ISEw (hov) 
aO age PN a tain NT —_ 1 a.S., 
infre, ISE (h) 


where the cross-validation bandwidth hoy is the minimizer of the score 
function (4) defined in Section 2, ISE, is the integrated squared error 
ISE,(h) = (fn — f)2w, for a suitable weight function w (with the role 
of eliminating endpoint effects), fp, is the estimator defined in (3), and the 
set Ln follows the usual regularity conditions (that can be found in the 
mentioned work). 

In a recent paper Gonzdlez-Manteiga et al. (1996) motivate the search 
for improved methods of bandwidth selection. Although the quality of the 
resulting density estimator is the crucial question, the rate of convergence 
for cross-validation type selectors is known to be very slow. In the re- 
ferred work “smoothed bootstrap” ideas are the base of a new criterion 
for choosing the bandwidth in censored hazard rate estimation. Here these 
considerations are translated to the context of density estimation under the 
Koziol-Green model. In Section 2 we introduce a bootstrap mechanism to 
select the parameter A for the estimator Thi 
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We discuss some fast algorithms based on WARP ideas (see Hardle, 1991: 
Fan and Marron, 1994) in Section 3, introducing methods for computing 
both the estimator (3) and also the two bandwidth selectors considered 
here. In Section 4 we present our simulation results for different propor- 
tional censorship models. As in Gonzdlez-Manteiga et al. (1996), the boot- 
strap selector behaves appreciably better than the classical cross-validation 


method. We also present a real example that follows the Koziol-Green 
model. 


2 The estimator: LSCV and bootstrap 
bandwidth selection 


A natural kernel estimator for the true parameter of interest f is given by 
= me n 

aly) = | Kaly- 0) Fa(dv) = D Kaly — Zin Oa(1 = Ha( Zi)? (3) 
11 


where Ky(.) = K(./h)/h is the rescaled kernel function K with smoothing 
parameter h satisfying h — 0 and nh — coo as n — oo. This estimator 
was considered by Csérgo and Mielniczuk (1988). These authors proved 


results on strong consistency, asymptotic normality and Bickel-Rosenblatt 
type confidence bands for (3). 


The LSCV bandwidth hoy considered here is the minimizer of 
CV(h) = | few —25~ Fra( Zin On(1 — Hn i(Zi))™ (Zi) (4) 
i=l 


where fhi and H,,, are the “leave-one-out” versions of fa and Hn respec- 
tively, given by 


fil) = X Kaly — Z;)(n — 1)7*6n(1 — HAZ) 
jA 
and l 
Hni(y) = (nm — 1) 2 l{Zj<y} 
ji 
The function CV (h) is an estimator of MISE,,(h) — J f?w, where 


MISEy(h) =E | (Ñ - Pw (5) 


is the mean weighted integrated squared error. 
We obtain the next result, giving an asymptotic representation of (5). 
We make the following assumptions: (i) the density f is continuous, (ii) 
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the function w has compact support contained in (0,7), where T satisfies 
1— H(T) > 0, and (iii) the kernel K is a square integrable probability 
density function with compact support. 


Theorem 1 Under the assumptions (i)-(iii): 
MISE,,(h) = AMISE,,(h) + 0((nh)~"*) 


where 
AMISE,,(h) = (nh)! R(K) I 6(1 — H) fw + J (Ky, * f — f)2w 
and R(K) = „K? (x denotes convolution). 


From now on we define the bootstrap bandwidth selector h* as the min- 
imizer of an estimated AMISE,,. “Smoothed bootstraps” are required to 
approximate the “bias” part of MIS E,,. We therefore propose the resam- 
pling plan: 


1. Draw bootstrap resamples {Z{,...,2,} from Hn * Kg. 
2. Generate independent bootstrap resamples {67,...,675} from a 


Bernoulli distribution with parameter 6. 


The distribution H,, * K; denotes the one having density h,(y) = .K,(y— 
v)H,,(dv). The bootstrap versions of the estimator (3) and the érror (5) 
are respectively 


FRY) = DO Kaly — Zin 0, (1 — Haz 
1=1 
and 8 
MISE,(h) = E* | (ft - f,)*w 
where H* is the empirical distribution function of the Z*’s and 07 is the 


sample mean of the 6*’s. Similar arguments to those used in the proof of 
Theorem 1 lead to: 


AMISE* (h) = (nh)! R(K) J 6,,(1—Hy* Ky)" fw J (Krf -Rw 

. . (6) 
where fy = O0n(1 — Hn * Ky)*"~'h, is another estimator of f under (1). 
Then h* is the minimizer of (6). This expresion shows that our bootstrap 
design mimics the theoretical AMIS Ew» in Theorem 1. 


Remark 1 We opted to deal with the pilot estimator ie instead of the 
theoretical bootstrap density fy in the definition of MISE. An analogous 
expression for the asymptotic MISE%, can be obtained using fg. 
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3 Warping algorithms 


Here we introduce fast algorithms to construct the estimator and to select 
the bandwidth, both for the LSCV and the bootstrap methods. As in 
Hardle (1991) and Fan and Marron (1994), the idea is to “bin” the data 
into an equally spaced grid, so that the number of kernel evaluations can 
be drastically reduced. Our bins are B, = Fa Gey, z € Z, where every 
interval has length 6 = h/M with M € Z*.We summarize the observed data 
by the nz = 0%, 18,(Z), z E€ Z*, the number of observations in the bin 
B,. The large number of kernel differences K;,(y — Z;) is approximated by 
the much smaller set of values wm(k) = K(k/M), fork =1—M,...,M-—1, 
when K is supported on [—1,1]. Fan and Marron (1994) showed that this 
results in computational speed savings of factors up to 100. 


3.1 WARPing the kernel estimator 


A WARPing approximation of fi at B, is given by 
M-1 
f(z) = (nM)! Y wa(k)On(1 — Hu (z+) ne (7) 
k=1—M 
where 1 — Hy(z) = n~ Ek>z nk. For a fixed h and letting M — oo , we 
have fur(y) — fa(y) (the WARPing approximation error decreases with 
the rounding error 6). This point is illustrated in Figure 1. 


3.2  WARPing cross-validation 


A WARPed version of the score function (4) in Section 2 involves essentially 
replacing the conventional kernel estimators by their WARPed versions, 
although additional work is required to get a rapidly computable version. 
We define the “leave out bin counts” nz‘ = J j4i 1B,(Z;) , 2 E Z, and we 
define 1 — H(z) = (n — 1)! ksz nk. 

Now we use the approximations 


| Ru ~ 6X fu(z)wm,n(2) 


and ” 
In" fil Zi) On (1 — Hna (Zi)) t w(Zi) 
w=1 


~ = E fue) a(l- Hal) new al2) 


2w (0) 


ers —1)6M 


Y > nzwor(z)(On(1 — Hm(2)) 7)? 
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where wm n(z) denotes the value of w at the lower limit of B,. Our ap- 
proximated score function becomes 


OV (M) = 6 > Ruma -— E Fa (2)On(1— Hiv (2) nwm a2) 


(8) 
Decay So nwm ala) (al — Hu 2)" >) 


Our proposal is to fix the rounding error 6 and then to minimize the 
function (8) in M. 


Weibull (2,1) 


Figure 1: True density function (thick solid line) for a Weibull(2,1) model, 
the kernel estimator (thin solid line) (computed directly on a sample of 100 
uncensored observations of such a model), and two warping approximations 
of this curve: 6=0.1 (dotted line) and 6=0.01 (dashed line). The role of the 
rounding error becomes clear; the approximation to the kernel estimator 
gets better as 6 decreases to zero. 


3.3 WARPing the smoothed bootstrap 


A WARPed version of the function (6) is obtained similarly by replacing the 
conventional kernel estimates with their corresponding WARPed versions. 
An approximation of AMISE*(h) is given by 


M-1 


AMISE*(M)=6S-(M~* Y wyulk)fm (z+ k)- fm (2)? wm,na(2) 
z k=1-M 


+(nh)*R(K)5 X 0n (1 — m, (2) fu (2)wm,a(2) (9) 
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where fm, and H mM, are the WARPed approximations of I and Hn * Kg 
with parameter Mı = g/ at Bz, given respectively by 


M-1 z 
Hm, (z) = (nM)! ` WM, (k)Tiz+k, with 7, = nj, and 
k=1-M j=— 00 


fm, (z) = On (1 = Hy, (z))— hy, (z), where 


M-1 
m (2) =(n6M)? Y wm (ese. 
k=1-M 


4 Simulation study. Example 


4.1 Simulation study 


In this subsection we compare LSCV and bootstrap bandwidth selectors 
for moderate sample sizes. We considered three underlying densities: 


Weibull (œa, à). The density of interest is taken as f = fa., satisfying 
(=f) = ar(Ax)*1e-P7)" Loo) (x) with a, A>0. 

Gumbel (a, A). Its density is f(z) = fo,,(x) = axe ree oes) (x) 
with a, A>0. 

Truncated normal model. We consider the density of the random 
variable Y = X | X > 0 „whee X €e N(p,c); that is, 
fiz) = 1h 10,00) (2); where $y,o(z) = ©, (x) and po is the dis- 
tribution of X. 


This simulation study was carried out as in Gonzalez-Manteiga et al. 
(1996) (see this article for details). 

Table 1 contains the results of 1.000 trials of sample size 100 correspond- 
ing to the following models: 


- Weibull models with parameters A=1 and a=1,2,3 without censoring 
(denoted by W(1,1), W(2,1) and W(3,1)) and also with 25% of censoring 
(denoted by CW(1,1), CW(2,1) and CW(3,1)). 

- Gumbel models with parameters A=1 and a=1,2,3 without censoring 
(G(1,1), G(2,1) and G(8,1)) and with 25% of censoring (CG(1,1), CG(2,1) 
and CG(3,1)). 

- Truncated normal distributions with parameters u=1 and o=0.5 for 
an uncensored situation and also with a censoring of 25% (N(1,0.5), CN(1, 


0.5)). 
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Although the only sample size in the simulations presented here is n=100, 
similar results were observed for n=50 and n=200. The triangular kernel 
was used. 


Mean Std.Dev. 

ae hee 
W(1,1) 20.96 15.21 | 14.54 8.50 

21.57 18.27 | 13.55 9.69 
W(2,1) 16.64 10.75 | 19.14 9.21 

ae ee 
W(3,1) 21.62 14.39 | 26.47 12.32 

ci lee ae 
G(1,1) 22.84 15.78 | 22.97 11.15 

23.46 16.36 | 22.44 9.72 
G(2,1) 42.58 30.63 | 33.01 19.02 

44.12 34.59 | 31.60 19.39 
G(3,1) 60.83 45.94 | 42.03 27.14 


N(i,0.5) | 14.82 1041 | 16.78 881 
CN(1,0.5) | 19.16 13.17 | 25.12 10.38 


Table 1: Mean and standard deviation of the integrated squared error (Lo- 
norm) of the kernel density estimator with hcy and h* bandwidths along 
1.000 trials of size 100. 


The values in Table 1 are the mean and the standard deviation of 
the ISE,, (Lg-norm) f (F — f)?w along the 1.000 samples of size 100. 
The columns headed hoy report the ISE, for estimates based on cross- 
validation bandwidths (resulting from the minimization of CV (M) in ex- 
pression (8)) and the columns headed h* report the ISE,, for estimates 
based on bootstrap bandwidths (resulting from the minimization of 
AMISE*(M) in expression (9)). Both minimizers were taken as the global 
minimizer over a fine grid. In both cases the rounding error in the WARP 
approximation was 6=0.01 and the weighting function was 
w(u) = lim-1(0.05),H-1(0.95)) (u). Finally, the pilot bandwidth g used is given 
by g = [(K"}m(K) tatn], where @ is the estimator of (AY)? (hz 
the density of Z) and (K) = ,t?K(t)dt. For a sample from a normal 
density, the pilot bandwidth g is Asymptotically optimal for the purpose of 
estimating the density curvature by the curvature of the kernel estimator 
(for details see Cao et al., 1994). 
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Std. Dev. 
hev h* 


Mean 
MODEL hev h* 
146.55 128.70 
CW(1,1) | 135.03 117.97 
105.86 89.47 
CW(2,1) | 96.92 89.27 
102.39 87.54 
CW(3,1) | 96.80 93.93 
G(1,1) 113.94 101.83 
107.99 91.20 
G(2,1) 124.45 112.14 
G(3,1) 128.52 117.05 
CG(3,1) | 120.48 106.84 
N(1,0.5) | 102.81 89.58 
CN(1,0.5) | 91.17 82.61 


47.86 37.88 
44.32 31.47 
49.98 38.96 
48.10 33.43 
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Table 2: Mean and standard deviation of the integrated absolute error (L}- 
norm) of the kernel density estimator with hcy and h* bandwidths along 


1.000 trials of size 100. 


Weibull(1,1) - Uncensored Case 


Figure 2: Kernel estimator of the densities of hoy (dashed line) and h* 
(solid line) using Gaussian kernel and smooth cross-validation bandwidth, 
based on 1.000 trials of size 100 of a Weibull(1,1) model (uncensored case). 
The vertical lines represent the values of hj, (solid line) and hy (dotted 


line). 


Table 2 is devoted to the Lj-norm ., | f — f | w. Tables 1 and 2 show 
that the L; and Lg norms are more cdncentratéd around their means for 
the bootstrap selector than for the cross-validation bandwidth. The Lı and 
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Lə norms tend to be smaller with the bootstrap bandwidth than that with 
cross-validation. To show the accuracy of both bandwidth selectors figures 
2 (uncensored case) and 3 (censored) present the kernel estimators of the 
densities of the two selectors hey and h*, using Gaussian kernel and smooth 
cross-validation bandwidth, based on a sample of 1.000 different values 
from the models W(1,1) and CW(1,1), as well as some approximations of 
the hz,w and hz,» bandwidths. These bandwidths were computed by 
minimizing the Monte Carlo approximation, based on 1.000 trials of the 
MISE», E (fn — f)?w, and the MIAE,, E. | fn — f | w, over a grid of 
h values, ranging from 0.1 to 1.1 with a step of 0.05. We conclude that the 
performance of h* is far superior to that of hoey. This is what we expected, 
because as noted in Cao et al. (1994) at least in the uncensored case we 
have 


hoy —hy _ -1/10 h* -hf _ -5/14 
hy = Op(n ) y = Op(n ) 


where hy denotes the optimum bandwidth which minimizes MISE. 


Weibull(1,1) - 25% Censoring 


Figure 3: Kernel estimator of the densities of hoy (dashed line) and h* 
(solid line) using Gaussian kernel and smooth cross-validation bandwidth, 
based on 1.000 trials of size 100 of a Weibull(1,1) model (25% censoring). 
The vertical lines represent the values of hj2 (solid line) and hj, (dotted 
line). 


To illustrate the computational cost of the WARPing approximation 
when used in the algorithms for finding h* and hey, the CPU times (relative 
to the minimum of all of them) are summarized in Table 3. While CPU 


Bootstrap selection of the smoothing parameter ... 395 


time increases slowly with the sample size, n, a sharp increase occurs when 
the rounding error, 6, gets small. We suggest the choice 6=0.01 in practice. 
This seems to give a good approximation of the kernel estimator (see Figure 
1) at a reasonable computational cost. (See Fan and Marron, 1994) for a 
comparative study of the CPU time of the WARPing approximation and 
the kernel estimator). 


6=0.005 
CV h* hCV h* 


1.33 | 1.00 
0.46 | 1.17 | 91.88 
0.60 


21.12 | 373.83 
1.35 | 107.68 | 20.54 | 433.19 | 107.95 


Table 3: Relative CPU times of the computations of LSCV and bootstrap 
bandwidths for samples of sizes n=50, 100 and 500 and rounding errors 
6=0.05, 0.01 and 0.005 in the WARPing approximation. 


4.2 Example: PCB-Liver data 


Between January, 1974, and May, 1984, the Mayo Clinic conducted a 
double-blinded randomized trial in Primary Biliary Cirrhosis (PCB) of the 
liver. A total of n=312 patients agreed to participate in this clinical trial. 
The data were analyzed in 1986 for presentation in clinical literature (see 
Fleming et al., 1991). By July, 1986, 125 of the 312 patients had died, 
resulting in a high proportion of censoring data (60%). 


-GROUPT | 163 [32 : 
GROUP Im | 50 | 31 | 10 (38.00%) | 6.45 (3.9,10.7) | 463 | sar ç 


Table 4: Sample size, number of deaths, percentage of censoring, Odds- 
Ratio and estimated median and mean for the four groups of patients 
considered in the PCB-Liver data example according to their prognosis 
bilirubin levels (<1.45, 1.45-3.25, 3.25-6.75 and >6.75). 


One of the most important risk factor for the survivorship of PCB is the 
serum bilirubin level (Fleming et al., 1991). As in the referred work, we 
shall consider four groups of patients, according to their prognosis bilirubin 
levels: Group I (bilirubin<1.45), Group II (1.45-3.25), Group III (3.25-6.75) 
and Group IV (>6.75). Descriptive statistics appear in Table 4, reveal- 
ing that the survival time decreases as albumin level increases. Whereas 
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Group I presents an 80.3% of censored data, Group II has an amount 
around 50.6%, Group III 38% and Group IV, only 14%. The four groups 
were tested to follow the Koziol-Green model using the test proposed by 
Henze (1993). The approximated p-values were 0.1627, 0.1552, 0.1765 and 
0.46764, respectively, failing to reject the proportional censoring model for 
each group. The estimator (3) of each density function can therefore be 
safely used. In all cases, a gaussian kernel was chosen. For each group, 
the weighting function considered here was w(u) = lja (u) where a and 
b are, respectively, the 5% and 95% percentiles of the corresponding ob- 
served survival time. The selected bootstrap bandwidth minimizing their 
respective function AMIS E*(M) (with rounding error 6=0.1) were h7=3.9, 
h*,=4.01, h7,;;=3.3 and h7,,=1.73. The density functions estimates are 
plotted together in Figure 4. Looking at this figure, one observes that the 
density estimates reflect the behaviour of the survival time, revealing a 
great amount of probability around the median values, in agreement with 
results presented above. 


bilirubin 
eb .75 
= 25.6.75 
Ba Ot 453.25 


wm 45 


TIME (in years) 


Figure 4: Kernel density estimators for the groups in the PCB-Liver data 
example: bilirubin > 6.75 (solid line), 3.25-6.75 (dashed line), 1.45-3.25 
(dotted line) and <1.45 (dashed-dotted line). The figure supports the nu- 
merical results in ‘Table 4. 


5 Conclusions 


We proposed a smoothed bootstrap selector of the smoothing parameter 
in kernel density estimation when the Koziol-Green model of proportional 
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censorship holds. We presented a mathematical analysis supported by sim- 
ulation. It turned out that our proposal behaves convincingly better than 
the cross-validation selector. We gave fast implementation using WARP- 
ing methods. We showed the practical interest of the introduced techniques 
through the analysis of real medical data sets. 
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Abstract: An extension of rank regression techniques to multivariate lin- 
ear models is proposed and studied. Unlike the co-ordinatewise rank 
regression techniques considered by some earlier authors, our approach 
is affine equivariant, and it is based on a transformation and retransfor- 
mation procedure originally developed by Chakraborty and Chaudhuri 
(1996, 1997) for constructing an affine equivariant version of multivari- 
ate median. Affine equivariance is expected to lead to superior statis- 
tical performance of our procedure compared to other non-equivariant 
procedures especially in the presence of substantial correlations among 
different response variables in multi-response problems. Some of the sta- 
tistical properties of the proposed multivariate rank regression estimates 
are discussed, and a few results based on numerical investigation of the 
performance of these estimates are presented. 


Key words: Affine equivariance, Hodges-Lehmann estimate, multivari- 
ate linear model, statistical efficiency, transformation retransformation 
estimates, Wilcoxon’s rank scores. 


AMS subject classification: Primary 62H12, 62J05; secondary 62F35. 


1 Introduction: multivariate linear model and 
rank regression 


Linear model is a widely used statistical tool for empirical analysis to un- 
derstand and make inference about the nature of inter-dependence that 
exists among different variables in the data. Perhaps it will not be an over- 
statement to say that various forms of linear model pervade almost every 
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area of applied research where statistics has its scope of being effectively 
used. Here our focus will be on multivariate linear models of the form 


Y=TX%+e , (1) 


where Y is a d-dimensional column vector of dependent or response vari- 
ables, X is a p-dimensional column vector of independent or regressor vari- 
ables, and T is the d x p matrix of unknown coefficient parameters that 
determine how different regressor variables jointly influence different re- 
sponse variables. This matrix is to be estimated from the observed data 
(X1, Yi), (X2, Y2),..., (Xn, Yn), and the term e in (1) is a d-dimensional 
column vector of random errors representing the random deviation of a 
data point from the linear model. In the special case when the response is 
real valued (i.e. when d = 1), rank regression techniques have been pro- 
posed and extensively studied as an alternative to traditional least squares 
regression by several statisticians [see e.g. Lehmann (1963a, 1963b, 1964), 
Adichi (1967, 1978), Koul (1969, 1970), Puri and Sen (1969, 1973), Jureck- 
ova (1971, 1973), Jaeckel (1972), Hettmansperger and McKean (1977, 1978, 
1983)]. These authors explored various extensions of rank based methods, 
which were originally developed for nonparametric inference in one and two 
sample univariate location problems, into very general linear models includ- 
ing standard ANOVA models. A primary motivation behind considering 
rank regression is the lack of robustness in least squares regression, which 
is known to have very poor performance when the random error e in the 
linear model (1) happens to follow non-Gaussian distributions especially 
those with heavy tails. Higher statistical efficiency of rank based nonpara- 
metric procedures compared to the inference based on sample means in one 
and two sample location problems involving univariate non-Gaussian data 
is known to extend for parameter estimates and related inference based 
on rank regression in linear models with univariate response. An excellent 
review of various rank based statistical methods for linear models with real 
valued response can be found in Draper (1988) and in fascinating comments 
and discussion that Draper’s expository article was successful in generating 
from leading experts in robust regression in linear models. 

Unfortunately most of the work documented in the existing literature 
on rank regression is restricted to univariate response. While many prac- 
tical situations (e.g. when a medical scientist is interested in studying the 
relationship between the age of an individual and his/her systolic and di- 
astolic blood pressures) do involve multi-response problems, virtually very 
little is available in the literature other than least squares techniques when 
the response Y in the linear model (1) happens to be multi-dimensional 
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in nature (i.e. d > 1). Rao (1988) and Koenker and Portnoy (1990) con- 
sidered robust estimation of parameters in multi-response linear regres- 
sion problems and suggested the use of univariate least absolute deviations 
method for each co-ordinate of the response vector. Davis and McKean 
(1993) have extensively studied the co-ordinatewise extension of rank re- 
gression from univariate to multivariate response problems. These authors 
have derived some interesting statistical properties of their proposed es- 
timates and tests and reported some results on numerical performance of 
the procedures. One serious drawback of co-ordinatewise extension of rank 
regression as well as that of least absolute deviations regression is that such 
extensions fail to take into account the inter-dependence that exists among 
the real valued components of the response vector, and in practice it may 
not be appropriate to ignore the correlation present among different re- 
sponse variables. Such an approach for multivariate linear models leads to 
parameter estimates that are not equivariant and to statistical tests that 
are not invariant under general affine transformation of the data, and it is 
known that procedures that lack affine equivariance/invariance may have 
very poor statistical performance in the presence of substantial correla- 
tion among the components of the response vector. This issue has been 
discussed and investigated by Bickel (1964), Brown and Hettmansperger 
(1987, 1989) and Chakraborty and Chaudhuri (1996, 1997) in the con- 
text of multivariate location problems. It will be appropriate to note here 
that Bai, Chen, Miao and Rao (1990) proposed to estimate T in the linear 
model (1) by minimizing the sum } %4- || Yi; —I'X;|| w.r.t. T, where for a d- 
dimensional vector x = (11, %2, . . . , Za), ||x|] = the usual Euclidean norm of 
> (x? +a3+. a2) 2: and such an estimate of I can be viewed as a gen- 
eralization of the notion of spatial median [see e.g. Haldane (1948), Brown 
(1983)] in linear models. While this leads to estimates that are equivari- 
ant under rotation or orthogonal transformation of the response vector, 
such estimates still lack equivariance under general affine transformation 
of the response. It is known that for multivariate data with correlated 
variables, spatial median may have poor statistical efficiency compared to 
affine invariant sample mean vector [see Brown (1983), Chaudhuri (1992a), 
Chakraborty, Chaudhuri and Oja (1997)]. Further, the lack of scale equiv- 
ariance makes spatial median as well as its generalization in linear models 
practically useless when different real valued components of the response 
vector Y are measured in different scales or when the response variables 
have different degrees of statistical variation. 

Chakraborty (1996) proposed and studied in detail an affine equivariant 
extension of least absolute deviations regression in multi-response linear 
model problems using a data driven transformation and retransformation 
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approach, which was used earlier by Chakraborty and Chaudhuri (1996, 
1997) to construct an affine equivariant version of multivariate median. 
Such a transformation and retransformation technique converts nonequiv- 
ariant estimates into equivariant ones and thereby improves upon the per- 
formance of the estimates as measured by appropriate statistical efficiency 
criteria. Our goal in this paper is to use the same transformation and 
retransformation strategy for developing affine equivariant rank regression 
techniques that can be used in the analysis of data following multivari- 
ate linear models. Chakraborty (1996) gave a convenient algorithm (called 
TREMMER) for computing the estimate of the parameter matrix T in (1) 
and amply demonstrated how resampling strategies like the bootstrap can 
be invoked to estimate finite sample-variance covariance matrix of such a 
parameter estimate. In Section 2 that follows, we will describe how one 
can appropriately modify TREMMER to come up with affine equivariant 
rank regression procedures for multi-response linear models. Such a mod- 
ification inherits the nice statistical properties of TREMMER, and in the 
special case of regression based on Wilcoxon’s rank scores or equivalently 
the linear regression analogs of Hodges-Lehmann type estimates [see e.g. 
Chaudhuri (1992b)], this modification takes a simplified and elegant form 
that makes the implementation of the methodology as well as investigation 
into its statistical properties quite convenient. In Section 3, we will present 
some results based on numerical studies that were undertaken to investigate 
the performance of the proposed methodology. We will discuss results from 
small sample simulation experiments that yield strong evidence for superior 
performance of transformation retransformation rank regression estimates 
in multi-response linear model problems when compared with traditional 
least squares and co-ordinatewise least absolute deviations estimates if the 
residuals in the linear model have elliptic non-Gaussian distributions with 
heavy tails. We will also report analysis of two real data sets each with 
bivariate response in an attempt to demonstrate the implementation of the 
methodology in real data and how it outperforms some competing non- 
equivariant procedures. Section 4 will conclude the paper with some re- 
marks on the issues that have become transparent in course of our present 
research, and there we will try to discuss briefly some of the open problems 
that require further research. 


2 The transformation and retransformation 
procedure 


Let us now focus our attention on the data points (X,, Y;)’s, which are 
assumed to satisfy the linear model (1). Suppose that n > d + p, and a is 
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a subset of size d + p of the set of indices {1,2,...,n}. Following the nota- 
tion used by Chakraborty (1996), we will write a = {t1,...,%,j1,---,ja} 
and denote by W(a) the p x p matrix whose columns are the vectors 
Xi,,---,Xi, and by Z(a) the d x p matrix whose columns are the vectors 
Yi,,---, Yip- We will assume that W (a) is invertible and form the dxd ma- 
trix E(a) that consists of the columns Y; — Z(a {W (a) tX; ---, Yj, — 
Z(a){W(a)}~'X,;,. The matrix E(a) too is assumed to be non-singular, 
and we define the transformed response vectors as Z\o = {E(a)}—'Y, for 
1 <L< n andl Aa. Suppose now that we perform rank regression on each 
co-ordinate of Zi separately with X; as the regressor as has been done in 
Davis and McKean (1993), and the resulting estimate of the matrix of coef- 
ficient parameters is denoted by A. In other words, A is obtained by 
minimizing (w.r.t. A) a dispersion function D(A) (say), which is a simple 
multivariate extension of Jaeckel’s dispersion function [see Jaeckel (1972)| 
based on residuals and their ranks computed from a linear model. In this 
case D(A) is a function of the real valued co-ordinates of the multivariate 
residuals zo — AX, with 1 < l < n, l Aa and their ranks [see Davis and 
McKean (1993)]. Finally, the transformation retransformation estimate of 
T is obtained by retransforming A? by the matrix E(a) as follows 


P = E(a)Ay . (2) 


In view of the definition of Po we now have the following result, which 


asserts that it is an affine equivariant estimate of I. As a matter of fact, 
this result is the analog of Chakraborty’s (1996) Theorem 2.1 in the context 
of rank regression. 


Result 1 Suppose that A is a fired d x d non-singular matriz. Then the 
transformation retransformation estimate computed from the transformed 
data points (X1, AY1), (X2, AY2),...,(Xn, AY n) in the same way as above 
(i.e. using the same index set a) will be ATO, Further, if the response 
vector Y; is transformed to Y; — GX; for each i = 1,2,...,n, where G 
is a fixed d x p matriz, the transformation retransformation estimate gets 
transformed to ro —G. 


Proof: In view of the construction of Z(a@), when the Y;’s are trans- 
formed to AY;’s, Z(a) gets transformed to AZ(a), and consequently the 
transformation matrix E(q) is transformed to AE(a). On the other hand, 
since the g ’s remain invariant under non-singular linear transformation 
of the Y;’s, so does the estimate AS? , which is obtained by performing 
co-ordinatewise rank regression of the ZL; ’s on the regressor vectors X,’s. 
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Hence, the transformation retransformation estimate will be transformed 
to AT as it has been claimed in the statement of the result. Next note 
that when each of the Y;’s is transformed as Y; — GX;, the matrix Z(a) 
becomes Z(a) — GW(a). However, the matrix E(a) remains unaltered 
after such a transformation of the response. Since each of the Zs gets 
transformed as Z — {E(a)}~1GX_’s in this case, the equivariance of co- 
ordinatewise rank regression estimate implies that AW will now become 
A a {E(a)}~!G. The proof is now complete in view of the way Å has 
been formed by retransforming ÂW into Ela ÂW. o 


2.1 Selection of the optimal data driven transformation 


Since the estimate po depends on the choice of the transformation ma- 
trix E(a), a question that naturally arises at this point is how to choose 
the subset of indices a. This question has been dealt with in other situ- 
ations by Chakraborty and Chaudhuri (1996, 1997), Chakraborty (1996) 
and Chakraborty, Chaudhuri and Oja (1997), who used transformation and 
retransformation techniques in different multivariate estimation problems. 
Depending on the nature of the problem, these authors have determined 
the form of the optimal transformation E(@) and suggested appropriate 
data driven selection procedure for the optimal subset of indices a. All 
these procedures for choosing the optimal transformation matrix, however, 
are based on the common idea of minimizing the generalized variance (i.e. 
the determinant of the dispersion matrix) of the multivariate location or 
regression estimate. The motivation for looking at the generalized variance 
comes from the fact that it is proportional to the volume of the concen- 
tration ellipsoid associated with the sampling distribution of the estimate 
which is usually normal for large samples. We will now state a result that 
asserts that under suitable regularity conditions pio is a n'/2_consistent 
and asymptotically normal estimate of the parameter matrix T in the linear 


model (1). 


Result 2 Fiz an a. Suppose that the distribution of the (X;, Y;)’s and 
the nature of the dispersion function D(A) are such that n'/2_ consistency 
and asymptotic normality of the co-ordinatewise rank regression estimates 
holds. For example the regularity conditions used in Davis and McKean 
(1993), who considered co-ordinatewise rank regression will be sufficient 
for this purpose. Then conditioned on a and the (Xj, Y;)’s with i € a, the 
asymptotic distribution of n\/? (rs? —T) is multivariate normal with zero 
mean and a variance covariance matrix that depends on the transformation 
matriz E(a). 
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Proof: Let us fix an a and argue conditionally given the (X;, Y;)’s with 
2 € a. Note that since A is obtained by performing co-ordinatewise 
rank regression of the transformed response vectors Z\ ’s on the covari- 
ates X,’s, it will be a n!/2-consistent and asymptotically normal estimate 
of {E(a)}~!IT under appropriate regularity conditions as assumed in the 
statement of the result. The proof is now complete if we recall that 
To = Ela ÂA and use the fact that linear transformation preserves 
multivariate normality as well as n!/?-consistency. O 

However, the conditional asymptotic dispersion matrix of ro depends 
on E(q) in a rather complex way, and it is hardly useful in providing any 
insight regarding the optimal choice of œ in a general situation. Alterna- 
tively, one can try to use resampling techniques (e.g. the bootstrap) to 
estimate the sampling variation in ro. and then select an optimal E(a) 
based on this estimate. However, it does not seem to be a feasible approach 
in practice in view of the enormous amount of computation that any form 
of resampling estimation of the dispersion of ri will require for different 
choices of a. 

Suppose now that e has an elliptically symmetric distribution with a 
density of the form {det(£)} -12 f(e7D—1e), where Ð is a d x d positive 
definite matrix, and f is a probability density function on the real line. Let 
us write {D~1/2E(a)}—! = R(a)J (a), where R(a) is a diagonal matrix with 
positive diagonal entries, and J (a) is a matrix whose rows are of unit length, 
and define D(a) to be the symmetric d x d matrix whose (i, 7)-th element 
is sin7} yij, yi; being the Euclidean inner product of the i-th and the j-th 
row of J(a). Then it was proved by Chakraborty (1996) under suitable 
conditions that the asymptotic generalized variance of the transformation 
retransformation median regression (i.e. TREMMER) estimate of T in the 
linear model (1) is minimized by choosing @ to minimize the determinant 
of the matrix 


V(a) = {I (a) HDH a) F - (3) 


Note that such a selection of œ does not require any knowledge of the form 
of the density f, and there is a nice and intuitively appealing geometric in- 
terpretation for such an approach. The determinant of V (a) is minimized 
when the columns of ©~!/?E(a) are orthogonal to one another. Hence, 
an alternative way of selecting E(a) to achieve a similar goal will be to 
minimize the ratio of the trace and the d-th root of the determinant of 
the matrix {E(a)}7"~!E(a), which is equivalent to minimizing the ratio 
of the arithmetic mean and the geometric mean of the eigenvalues of the 
positive definite matrix. In the absence of any other better and practically 
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feasible procedure, we intend to use this criterion for choosing the trans- 
formation matrix for our multivariate rank regression. In other words, our 
recommendation amounts to transforming the response vectors using a new 
data driven co-ordinate system determined by the transformation matrix 
E(q@) such that the co-ordinate system is as orthogonal as possible in the 
d-dimensional vector space, where the inner product and orthogonality are 
defined based on the positive definite dispersion matrix X of the residual 
distribution associated with the linear model (1) [see also Chakraborty and 
Chaudhuri (1996, 1997) and Chakraborty, Chaudhuri and Oja (1997)]. Of 
course we need an appropriate estimate of X in order to implement such 
a strategy, and we can get that from the residuals computed at an initial 
stage after fitting the linear model to the data by any simple and suitable 
method. Note that it is important that such an estimate of X be equivariant 
under linear transformation of the response vectors. 


2.2 Multivariate rank regression using Wilcoxon’s score 


Let us now consider the dispersion functions associated with well known 
Wilcoxon’s rank scores. Such dispersion functions can be expressed in the 
form 

DA= E E (2 +26) - AX, +X.) (4) 


1<r<s<n jr,s¢a 


or 


DA= D > |(Zh - 2) - AX,- X.) 


l<r<s<n;r,s¢a 


y (5) 


where for a d-dimensional vector x = (£1, £2,..., £a), |x| = the l,-norm 
of x = |x| + |z2| +...+|xzg|. Note that the dispersion in (4) originates 
from Wilcoxon’s signed rank score used in single sample location problems 
while that in (5) is related to the two sample Wilcoxon’s rank test. The 
second dispersion can also be viewed as a form of Gini’s mean difference 
of multivariate residuals, and it is meaningful to use this dispersion func- 
tion when there is no intercept term present in the linear model (1). On 
the other hand the dispersion function in (4) is useful in multivariate lin- 
ear models with intercept terms. Readers are referred to Aubuchon and 
Hettmansperger (1989) and Chaudhuri (1992b) for a detailed discussion of 
these dispersion functions and their use in rank regression in linear models 
with univariate response. 

The estimates of the coefficient matrix obtained through minimization 
of dispersion functions in (4) and (5) can be viewed as natural extensions 
of the well-known Hodges-Lehmann estimates from one and two sample 
location problems into multivariate linear models. Observe at this point 
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that minimization of any of these two dispersions leads to a co-ordinatewise 
least absolute deviations problem, and hence the computation of the trans- 
formation retransformation estimate To can be easily handled by some 
straight forward modification of the TREMMER algorithm developed by 
Chakraborty (1996). One only needs to replace the original data by their 
pairwise averages or differences (depending on whether (4) or (5) is used) 
before invoking TREMMER. We now state a result that establishes asymp- 
totic optimality of our procedure for choosing the transformation matrix 
E(q) as described in Section 2.1 when rank regression is performed using 
Wilcoxon’s score in a multivariate linear model with the residual having 
multivariate normal distribution. 


Result 3 Suppose that the residuals e; = Y; — CX; for 1 <i < n are 
1.1.€d and have a common d-variate normal distribution with zero mean and 
» as their common dispersion matrix that does not depend on the regres- 
sor (i.e. we have perfect homoscedasticity), and the 1.i.d random regressors 
X,’s have a distribution with an associated p x p expected information ma- 
tric E(X;XT) = Q that is positive definite ensuring asymptotic normality 
of the co-ordinatewise rank regression estimates obtained using the disper- 
sion function (4) or (5) [cf. the asymptotic results in Chaudhuri (1992b)/. 
Then our procedure for choosing the set of indices a and the associated 
transformation matriz E(a) described in Section 2.1 yields a transforma- 
tion retransformation estimate To such that the asymptotic generalized 
variance of ni/ (P9 — T) tends to its minimum possible value as n tends 
to infinity. 


Proof: Once again let us fix a and argue conditionally give the (X;, Y;)’s 
with 7 € a. Note that when the dispersion function (5) is used, there 
are no intercept terms in the multivariate linear model, and without loss 
of generality we can assume in this case that the X;’s have zero mean. 
Under the conditions assumed in the statement of the result, it is easy to 
establish a Bahadur type asymptotic linear representation of To) using the 
asymptotic results in Chaudhuri (1992b), and this implies that as n tends 
to infinity, the limiting distribution of n!/ 2p) —T) is multivariate normal 
with zero mean and a variance covariance matrix that has the form 


{J (a)}-A(a){[F(a)]7} 7? @Q™* , (6) 


where Q denotes the usual Kornecker product of matrices. Here J(q) is 
the matrix whose rows are obtained by normalizing the rows of the matrix 
{y)-1/2(a)}~1 as described in Section 2.1, and H(a) is the d x d sym- 
metric matrix with (i, 7)-th element equal to 2sin~+(7;;/2), Yiz being the 
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Euclidean inner product between the i-th and the j-th row of J(a). It is 
the multivariate normality of the residual distribution in the linear model 
that enables us to simplify the the form of the asymptotic dispersion ma- 
trix in this special case. It is clear from (6) that the asymptotic generalized 
variance of the transformation retransformation rank regression estimate 
will be minimized if we choose a to minimize det{H(a)}/[det{J(a)}]?, and 
this is accomplished when the rows of J(a@) or equivalently the columns of 
> 1/2 (a) are orthogonal to one another. O 


3 Numerical results: simulation and data 
analysis 


In an attempt to investigate the performance of transformation retransfor- 
mation rank reregression methodology in finite sample situations, we ran a 
simulation study and analyzed a couple of real data sets for which there are 
some appropriate multi-response linear models. We compared our approach 
with more traditional procedures some of which are not affine equivariant, 
and as we will gradually see the results turned out to be quite encouraging 
and favorable for our affine equivariant rank regression. 

A Simulation Study: We considered a problem with sample size 
n = 30, where the data was generated from a multivariate linear model 
like (1) with d = p = 2, and the first co-ordinate of X was taken to be the 
constant 1.0 while the second co-ordinate was generated from a standard 
normal distribution. We chose T as the 2 x 2 zero matrix, and for the ran- 
dom residual, we used three different elliptically symmetric distributions 
i.e. distributions having densities of the form {det(=)}—1/? f(e7X~1e). 
These distributions are bivariate normal, bivariate Laplace [i.e. when 
f(eZe) = (2m)~1exp(Vete)| and bivariate t with 3 degrees of freedom. 
We used the dispersion function (4) for computing the transformation re- 
transformation estimate To after choosing a@ using the selection procedure 
described in Section 2.1. Let Ess and Eaa denote the efficiencies of our esti- 
mates compared with the ordinary least squares and co-ordinatewise least 
absolute deviations estimates respectively. These efficiencies were com- 
puted using the fourth root of the ratio of the generalized variances of 
competing estimates [see e.g. Bickel (1964)], and the generalized variances 
were estimated using 3000 Monte Carlo replications in each case. Since 
both of ordinary least squares estimate and our estimate of I are affine 
equivariant, Eois does not depend on X&. We observed that for bivariate 
normal Ess = 82%, and for bivariate Laplace Ess = 101%. However, for 
the t distribution with 3 degrees of freedom, which is a distribution with a 
fairly heavy tail, we observed that Esos = 150%. Since the co-ordinatewise 
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least absolute deviations regression does not lead to an affine equivariant 
estimate of T, Eaa depends on X. For our simulation study, we have used 
different choices of X, and each choice had both diagonal entries equal to 
1.0 and both off diagonal entries equal to p. Five different values of p 
were used, and they are 0.75, 0.80, 0.85, 0.90 and 0.95. The results are 
summarized in the following table. 


Table 3.1: Values of Eq for different 
choices of the residual distribution and p. 


Residual Values of p 
Distribution 0.75 | 0. S 0.85 | 0.90 | 0.95 


Analysis of Blood Pressure Data: This data was collected by the 
Biological Sciences Division of Indian Statistical Institute, Calcutta, and 
it consists of systolic and diastolic blood pressures of 40 Marwari females 
residing at Burrabazar area of Calcutta and their ages. It is well known 
to physiologists that arterial pressure increases with age, and age is con- 
sidered to be a factor of prime importance in deciding what should be the 
normal arterial pressure of an individual. As one would expect, there is 
ample evidence in the data [see e.g. Chakraborty (1996) who analyzed the 
same data using TREMMER] for the the presence of high positive corre- 
lation between systolic and diastolic pressures, and hence one can argue in 
favor of using an affine equivariant procedure, which is expected to be sta- 
tistically more efficient for analyzing this data set than a non-equivariant 
procedure such as the co-ordinatewise least absolute deviations regression. 
We applied our affine equivariant rank regression procedure based on the 
dispersion function (4) to this data and obtained the following estimated 
linear equations : systolic pressure = 100.64 + 0.8(age), and diastolic pres- 
sure = 74.04 + 0.32(age). Following Chakraborty (1996), we estimated the 
sampling variations using 2000 bootstrap samples for each of the compet- 
ing procedures and observed 66.9% gain in statistical efficiency when our 
affine equivariant rank regression was compared with co-ordinatewise least 
absolute deviations regression. The coefficients of age in both the equations 
here are slightly larger than those obtained by Chakraborty (1996) using 
TREMMER, and their standard errors (0.20 and 0.11 for systolic and dias- 
tolic pressures respectively) estimated through bootstrap turned out to be 
smaller than those for the TREMMER estimates [cf. Chakraborty (1996)). 
It will be appropriate to note here that for using TREMMER, Chakraborty 
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(1996) reported 14.5% gain in statistical efficiency over non-equivariant co- 
ordinatewise least absolute deviations. 

Analysis of Demographic Data: This second data set consists of to- 
tal fertility rates (TFR), infant mortality rates (IMR) and female literacy 
rates (FLR) for the years 1971, 1981 and 1991 for sixteen most populated 
states in India. The data is given in a nicely compiled form in Srinivasan 
(1995). TFR is defined as the number of children born to a woman in her 
entire reproductive span assuming that she experiences the level of age- 
specific fertility rates in a given period of time, and IMR is the number 
of deaths of infants (i.e. children below age of one year) per thousand 
live births during a given period. Socio-demographic studies have strongly 
revealed education of women as a major determinant of visible decline in in- 
fant mortality and total fertility levels of the population. Our main interest 
here is in exploring the nature of dependence of TFR and IMR on FLR as 
well as the changes in TFR and IMR over time and their regional variations, 
and for this we have used a multivariate analysis of covariance type model 
with four regional effect parameters (corresponding to northern, southern, 
eastern and western regions of the country) and two covariates, namely 
FLR and time. Once again in view of strong correlation between TFR and 
IMR [see Chakraborty (1996)], any non-equivariant estimation procedure 
is expected to perform poorly in this case. Since here one is interested 
in the differences between regional effects, the dispersion function in (5) 
is quite appropriate. When we compared our affine equivariant procedure 
with non-equivariant co-ordinatewise rank regression based on Wilcoxon’s 
score using botstrap estimates of sampling variations, we observed about 
8% gain in statistical efficiency. As in the preceding example, here too we 
used 2000 bootstrap samples for each competing procedure. In the case of 
our affine equivariant procedure, time with estimated coefficients -0.4929 
and -9.5899 having standard errors 0.1964 and 4.5880 respectively appeared 
to be a statistically significant covariate indicating decline in both of TFR 
and IMR over time. FLR too turned out to be a statistically significant 
covariate with estimated coefficients -0.03775 and -1.3006 having standard 
errors 0.01223 and 0.2983 respectively indicating a strong influence of fe- 
male education on decreasing TFR and IMR. However, our analysis did not 
reveal any statistically significant regional difference in fertility and mor- 
tality rates. These findings are in conformity with the results reported in 
Chakraborty (1996) who analyzed the same data using TREMMER. 


4 Concluding remarks and discussion 


An important issue that emerges at this point is that the problem of multi- 
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variate rank regression is intrinsically related to the problem of multivariate 
quantiles and ranks. Readers are referred to Chaudhuri (1996) and Motto- 
nen and Oja (1995) for a detailed review of various notions of multivariate 
quantiles and ranks. A particularly interesting alternative to our present 
approach of multivariate rank regression will be to use the ranks associated 
with spatial or geometric quantiles [see Chaudhuri (1996), Mottonen and 
Oja (1995)]. Affine equivariance can still be achieved through data driven 
transformation and retransformation as has been done in Chakraborty, 
Chaudhuri and Oja (1997), where an equivariant modification of spatial me- 
dian and an invariant modification of angle test were proposed and studied 
based on the idea of transformation and retransformation. However, such 
a geometric concept of quantiles leads to vector valued ranks that are very 
different in nature from co-ordinatewise ranks, and one needs to redefine 
the multivariate analog of Jaeckel’s dispersion function appropriately using 
some suitable notion of score functions defined for vector valued ranks. 

So far we are able to prove asymptotic optimality of our procedure for 
selecting the subset of indices a and the associated transformation matrix 
E(q) only in a very special case, i.e. for dispersions associated with Wilcox- 
on’s rank scores, and when the residual in the linear model (1) is normally 
distributed. In the case of multivariate median (or least absolute devia- 
tions) regression, Chakraborty (1996) was able to show that the proposed 
data based selection rule for choosing the transformation matrix leads to 
an asymptotically optimal solution whenever the residual distribution is 
elliptically symmetric. The nice geometric interpretation of this selection 
procedure described in Section 2.1 makes us believe that its asymptotic op- 
timality holds under much weaker and more general conditions than what 
we have assumed in Result 3. 

Rank regression in linear models with univariate response generated fas- 
cinating research problems and innovative statistical tools for nearly three 
decades [see Draper (1988)]. This enriched our theoretical understanding of 
linear model analysis and enabled us to invent new methodology for explor- 
ing relationships present among different variables in the data. Multivariate 
rank regression is likely to lead us to a more fertile ground for methodolog- 
ical and theoretical research. As multi-response problems do arise often in 
practice, there seems to be a real need for a serious and extensive research 
of rank regression to deal with such problems. 
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Abstract: In high dimension, the estimation of a density is difficult be- 
cause the observed data gets increasingly sparse with the dimension. This 
is known as the curse of dimensionality. For that reason, in high dimen- 
sion, universally consistent estimators such as the kernel density estima- 
tor are not practical. In this paper, we consider a class of multivariate 
densities, within which a density function f can be expressed as f = goD 
for some given notion of data depth D and some real function g. We pro- 
pose a density estimator which is shown to be consistent within the class, 
and it converges at the same rate as the univariate kernel density esti- 
mator. 
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1 Introduction 


Let Xi, ..., Xn be ani.i.d. sample from an unknown density f : RP — 
(0,00). When p is large, a kernel density estimator of the form 


f(a;h) = et ( =) 
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where h is the smoothing parameter, is impractical because of the extremely 
large number of observations needed to “fill” the p-dimensional space in 
order to ensure sufficient observations in each ”bin” of the kernel. This 
phenomenon of sparsity of data in high dimensional space is referred to as 
the “curse of dimensionality” in Bellman (1961). 

Many approaches have been developed in the literature in order to ad- 
dress this problem, and in particular projection pursuit techniques (Fried- 
man, Stuelzle and Schroeder, 1984). Projection pursuit avoids this problem 
by working in low-dimensional linear projections. However, as pointed by 
Huber (1985), projection pursuit is poorly suited to deal with highly non- 
linear structures. For these reasons, the analysis of high-dimensional data 
sets is often made under some additional restrictions. One common prac- 
tice is to assume that the density f belongs to some parametric family, so 
the estimation of the density amounts to the estimation of finitely many 
parameters. For example, if the underlying density is normal, then one only 
needs to estimate the mean and the variance. In the same spirit but without 
the firm grip of parametric assumptions, the so-called “tailor-design density 
estimates” (cf. Devroye, 1987) are designed to perform well for a particu- 
lar class of densities. This class can but does not have to be parametric. 
In general “tailor-design density estimates” are not universally consistent, 
since they are tailored to suit a specific target class of densities. A typical 
example is the Grenander estimator (Grenander, 1956) which concerns only 
monotone densities. 

In this paper we rely on the general nonparametric smoothing principle 
to provide a multivariate density estimator, with the idea of enlarging the 
neighbourhood for smoothing so as to include sufficiently many data points 
even when the dimension is high. Roughly speaking, our approach may 
be viewed as a generalized version of the following simple nonparametric 
density estimator 


fa(z; An) = n m ae 


where 1 denotes the indicator function, mp denotes the p-dimensional 
Lebesgue measure and Ap (x) = {t 3 ||z — t|| < h}. The above estimate 
takes advantage of the smoothness of the unknown density f, and assumes 
that f is nearly constant in the neighbourhood A, (x). As indicated above, 
the difficulty with this approach in p dimensions is that the volume of the 
neighbourhood A,(x) decreases rapidly with p. As a result, the variance 
of the estimator increases rapidly with p and one is forced to increase the 
bandwidth h to obtain a balance between the variance and the squared bias 
of the estimator. On the other hand, Ap (x) is not the only set over which 
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f can be presumed constant. The density f is also nearly constant over 
the larger set Bpn(x) = {t 5 || f(x) — f(t)|| < h}. Used as a neighbourhood, 
Bp(ax) has a large volume even in high dimension, so one is not forced to 
use a large bandwidth to obtain a balance between the variance and the 
squared bias. In contrast with the neighbourhood A;(z) which does not 
depend on f, the neighbourhood B;(z) does and needs to be estimated. 
Of course, the estimation of the neighbourhood B,(z) is difficult and may 
very well offset the improvement resulting from enlarged neighbourhoods. 

Our estimator is based on a neighbourhood which is between the above 
two extreme cases corresponding to the sets A;(z) and B,(x). Assume 
that f = go D for some g : R — [0,0o) and some transformation D : 
RP — (0,00) that may depend on f. Under this restriction, f is constant 
whenever D is constant so that f is nearly constant over the set Cp(x) = 
{ft > ||D(xz) — D(t)|| < h}. The estimator we propose essentially amounts 
to using the set C},(x) as the neighbourhood. 

In recent years, the class of ellipsoidal densities f = g o D with D(x) = 
(x — u)! E} (x-— u) for some function g has received considerable attention 
because it enables an analysis of multivariate data which does not rely on 
the validity of the classical multivariate normal theory. The estimator we 
propose goes hand in hand with such developments, providing an estimate 
of the density that outperforms other density estimators within that class. 
The class of ellipsoidal densities is one example among other possible gen- 
eralisations of the classical multivariate normal family. Various possibilities 
will be discussed in what follows. 

Let mg denote the d—dimensional Lebesgue measure. Throughout the 
paper, we will assume that the measure mp o D7! is absolutely continuous 
with respect to mı and L, will denote the Radon Nykodym derivative of 
Mp O D-t} with respect to mı. Under this assumption, 


Pr{D(X1) € A} = J 1{D(t) € A} g(D(t)) dt = J, OR 


showing that D(Xı) has a density, which we will denote f,, and that 
fp =gLp. The relationship fp = gf, can also be expressed as 


f(c) = 9(D(c)) = Gy 


Thus, if D is a known transformation, we can estimate f(x) by 


ee LOO aS 
F(a;D.h) = 2 De) -i9 E DE) 
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Example 1 Consider the case of a spherical density f. This assumption 
is equivalent to the assumption f = go D for D(x) = |\z\|? and some 
g:R — R. Here, D is a fixed transformation and does not need to be 
estimated. It is easy to see that 


a P/2 „p/2-1 
r(p/2) 


is the Radon Nykodym derivative of mpo D~' with respect to mı. Thus, in 
the case of a spherical density f, we propose the estimator 


Ly(r) = 


Fle; D, h) = “2) (D(a) LS Ky(D@) - D(X). 
i=1 


The estimator f (x; D, h) basically amounts to an estimator of the univariate 
density f, so that we would expect a one-dimensional nonparametric rate 


of convergence to f(x) for f(x;D,h). 


In general, D could depend on f and would then need to be estimated. 
If D is an estimator of D, we can estimate f(x) by 


a 7 f,(D@) 12 kDa) D(X) 
z; Ô,h -Y 
NS T D ny L£,(D@) 


Example 2 Consider the case of an ellipsoidal density f. This assumption 
is equivalent to the assumption f = g o D for D(x) = (x — u) TET} (z — u) 
and some g : RP — R. Here, D depends on unknown parameters u and X 
that need to be estimated. It is easy to see that 


nPI? 
E EN E 2-1, 
p r(p/2) 
is the Radon Nykodym derivative of mpo D-t} with respect to mı. Thus, in 
the case of an ellipsoidal density f we propose the estimator 


eons e (aia)? 1 2 Kablo) - B(X:)) 


where D(x) = (x — A)T -t(x — fi) and ji and È are some estimates of p 
and &. The transformation D involves parameters u and & for which there 
exist estimates converging at speed 1/,/n. Since the rate of convergence for 
p is less than 1/./n, we expect the asymptotic behaviour of f(x;D,h) to 
be unaffected by the estimation of D and a one-dimensional nonparametric 
rate of convergence to f(x) for f(x;D,h). 
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Stute and Werner (1991) focus on this last case and propose an estimate 
of the density based on the above formula, using independent estimates of 
p and X (based on the data of some preliminary sample). 

The above two examples are unusual in that an explicit expression for L = 
is available. When deriving exact expressions for L,,, the following relation- 
ships are Ar - D be some depth. If D, (z) = D((x—u) "E~! (x—p)), 
we have L, Bes = Xll, (r). If pisa monotone transformation and if 
D(z) = aaa Hi we have Lp, (r) = (Lp /|p'|)(o*(r)). Nevertheless, an 
explicit expression for L, is usually not available and an approximation for 
the denominator must be used. Note that provided L, is smooth, 


£,(D(2)) 


| Kn, (D@) - DH) at 
J Ko Lp(D(@) ~ hpu) du > £,(D(2)) 


as hy converges to 0. Thus, our purpose in this paper is to investigate the 
properties of 


2 Kx(D(2) - D(X) 
fle:D.) == Fe Da) De) a 


and 
i oe i ee Kp(Dn(z) — Dn(X:)) 
T: Dn, h)=-— <r 
FK Io 2, J Kn, (Ôn (£) — Ôn (t)) dt 


where K,(z) = K(x/h)/h for some kernel K : R — [0,00) and some 
bandwidths h and hy. The particular case with hy = 0 corresponds to 
the situation where there exist an explicit expression for £, and, since 
hy is merely used to provide a simple approximation for £,(D(z)), our 
intention is to let h y converge to 0 faster that h does. The estimators 
f(z; D,h) and f (z; Dn, h) are always non-negative but do not integrate to 
1 and, in practice, they need to be normalized. 

Notions of multivariate depth are interesting candidates for D because 
they can usually be estimated at the usual parametric rate 1/ y/n. Since 
fp is estimated at a one-dimensional nonparametric rate of convergence, 
the estimation of D should not affect the overall rate of convergence. ‘This 
implies that for the class of densities such that f = g o D, we get a one- 
dimensional nonparametric rate of convergence in a p-dimensional density 
estimation problem. This assertion will be proved in the next Section. 
Many notions of multivariate depth have been defined and studied in the 
literature (see Small, 1990) , including 
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e Mahalanobis depth (Mahalanobis, 1936) 


D(x) =1/ (1+ (z — u)" E! {s — DJE 
e Tukey’s depth (Tukey, 1975) 


D(x) = inf l [ dP(zx) H is a closed half-space containing z) : 


e Simplicial depth (Liu, 1990) D(z) = Pr {z € S[X,..., Xp+1]} 

where S[X1,..., Xp+1] is the simplex with vertices X}, .. SADT 

e APL depth (Fraiman and Meloche, 1996) D(x) = K, * f(x) for some 
kernel K and some fixed smoothing parameter y. 


All of the above depths can be estimated at the 1/,/n parametric rate 
so that they can all be estimated at no cost in terms of the asymptotic 
behaviour of f (z; Dn; h). The depth, however, does have an impact on the 
asymptotic bias and variance of f (x; Dn; h). More importantly, the depth 
D determines the class of densities f of the form f = go D. As described 
in Example 2, for D(x) = (xz — u)" 1 (a — u), f = go D if and only if f 
is ellipsoidal. In the case of Simplicial depth, although we know the level 
curves of D must be convex, it is not clear how large the class of densities 
f of the form f = go D is. The level curves for APL depth don’t even need 
to be convex but the equation f = go D does not appear to be satisfied 
except for ellipsoidal densities. 

As noted before, density estimators that perform particularly well under 
some Class of density are called “tailor design density estimate” by Devroye. 
One can regard the estimator we propose as one that will take advantage 
of the relationship f = go D for a given notion of depth. The proposed 
estimates are not universally consistent (they converge only if f = goD) but 
they provide better performance than the universally consistent estimate 
on the class of densities f of the form f = go D. 


2 Main Results 


In this section, we present results concerning the strong convergence and 
the asymptotic normality of 


> _Kn(D(@) = D(X) 
f(z; D,h) = I aO 


= Kn(Da(o) — Da(X:)) 


n Zi J Kas (Ôa (E) — Dalt) dt 
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where D,, is an estimator of D. We summarize below the assumptions that 
are needed. Proofs can be found in the Appendix. 


HO: The bandwidth sequence h = hy is such that nh®> — 6? € (0,00). 
The bandwidth sequence hs = hpf converges to zero faster than h does: 
h/h — 0. 


H1: The kernel K is symmetric, has a bounded support, integrates to 1 
and has three bounded and continuous derivatives. 


H2: Xj, ..., Xn are iid. with some density f : RP — (0,00) such that 
f = go D for some function g : R — (0,00). Both g and f are bounded 
and have two bounded and continuous derivatives. 


H3: There exist a 1 — 1 and continuously differentiable transformation 
T : R x (0,1)? 1 — R? such that D(T(r,6)) = r for all r € R and all 
0 € [0,1}?-1. The transformations T and its Jacobian J, have two bounded 
and continuous partial derivatives with respect to r. 


H4: The inverse image by D of a bounded set is bounded and z is in the 
interior of the support of D. 


Assumptions HO and H1 are more or less the standard assumption for 
the bandwidth and the kernel in kernel density estimation. The bounded 
support of the kernel is not usually needed but simplifies the proofs. The 
necessity of Assumptions H2 can be explained as follows. Since f (x; D,h) 
is a function of D(x) so that we can only hope to get consistency if f = goD 
for some g : R — [0,00). Note that, by virtue of H3, 


Pr{ D(X) € A} = f 1{D(t) € A} g(D(t)) dt 

= f J 1{D(T(r,0)) € A} g(D(T(r, 0))) |J; (r,8)| dOdr 

= J J Afr € A} g(r) |J; (r,0)| dôdr 

= fa g(r) £p (r) dr 
where Lp (r) = f |J (r,0)| d0. Thus, H3 implies that the random variable 
D(Xı) has the density fp = gLl,. 


Theorem 1 If HO-H3 hold, 
Vinh { f(z; D,h) - f(2)} = 


f (D(z)) x 
N € p fwK(u) du Poy S K’ (u) du iy) l 
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The asymptotic distribution of f (z; D, h) should be compared to that of 
the multivariate kernel density estimator f(x; h) which is well known to be 


Vahi { f(@;h) - f(z)} ++ N (5 B J u2K(u) du a f K?(u) du f(x) | 


Table 1 provides the asymptotic bias and variance for both f (x;h) and 
f (x; D,h). The most striking difference is the rate of the convergence to 0 
for the asymptotic variance. The asymptotic bias has the same rate of the 
convergence to 0, but the constants depend on f in different ways. In the 


table, a = f u?K(u) du and B = f K?(u) du. 
Table 1: The asymptotic bias and variance for f(x; h) and f(x; D,h). 


-Estimator Asymptotic Bias Asymptotic Variance 
; 1 

Fh) lah? Vf) By f(z) 

fla;D,h) ban? 200) 


f(z) 
2 Ly (D(z)) nh fey 
D(z) =|e-p| (p=1) Joh? PDE) EY 


( ah BA DO er 


Our proof of the asymptotic normality for f (x; Dn, h) uses a three term 
Taylor series approximation for f(x; D,,h). We prove that provided Vnh 
Dn — Dildo = Op(h§), 


Vinh { f(a; Ôn, h) — f(a;D,h)\ Z 0. 


Note that for the kernel density estimator, the optimal bandwidth is of the 
order n~!/5 so that if || Ôn — Dllo = Op(1//n), the condition Vnh || Dn — 
DI. = o (h$) amounts to h/ (n?h$) — 0 and can be satisfied for the 
optimal bandwidth h and a slightly faster hy. A higher order Taylor series 
approximation for f (z; Dn, h) would result in weaker restrictions on the 
bandwidths h and hy but at the cost of additional smoothness requirements 
on K, T and J}. 


Theorem 2 Assume HO-H4 hold and define 
Hyx(8,t) = n E { A} (x, X1) Ak (z, X2)|X1 = 8, Xo =t} 
and 
Gnk(s,t) = n E AE (z,s) AE (x,t) 
where An(2,y) = (Dalz) - Dn(y)) - (D(e) - D(y))). If Hnr and Gnr 


have two countinuous and uniformly bounded derivatives for k = 1,2 and 
if Vnh || Dn — DI, = op (h$), f(z;Dn,h) and f(x; D,h) have the same 


asymptotic distribution. 
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Besides the smoothness assumption (H2) made on f and g, the ap- 
plication of Theorem 2 requires the verification of H3, the existence of a 
smooth change of variable transformation T such that D(T(r,6)) = r. The 
existence of such a transformation is guaranteed whenever D has a unique 
global maximum M (such a global maximum can be regarded as the deep- 
est point or a median) and is decreasing along every ray originating from 
M. In such circumstances, we can define T as the inverse of the “polar 
transformation” (D, ©) where D is replacing the usual norm and where the 
angles © are determined about the maximum M. The smoothness of the 
resulting T is equivalent to the smoothness of D. Thus, we can reformulate 
all the smoothness assumptions in terms of the smoothness of g and D. 


3 Simulation 


In this section, we present a small simulation that compares f (x: Dn, h) to 
the kernel density estimator f (x,h) in cases where the underlying density 
f is of the form f = go D. The simulation involves the Mahalanobis depth. 
It involves 100 samples of 50 observations i.i.d. with distribution 


0 1.0 0.9 
N((o) Cos 10): 
For the Mahalanobis depth, D(z, y) = (£ — u)TE~! (z — u) and D,(z,y) = 
(x — A)TË-! (x — ñ) with the usual sample moment estimators fi and È. In 
this case we know £L, (r) up to a normalizing factor and we use hy = 0. We 
evaluate both f (x; Dn, h) and f (z,h) with their respective ASK-optimal 
bandwidth determined by minimizing over h 


50 


rs A 1 à A 2 
ASE(f(2; Dash), h) = = Yo (F(X Dash) — F(X) 
i=1 
and 
. 1 22. $ 
ASE(f(z;h),h) = z5 Ð (ÑX; h) — FX) 
i=1 
respectively. ae 
Note that the numerator of f (x; Dn, h) is in fact a kernel density esti- 
mate for the data Dp(X1), ---, Dn(Xn) evaluated at D,(x). In practice, 


a boundary correction kernel must be used because D, has a bounded 
range. It is also possible to use a transformation to avoid having to make 
a boundary correction. The simulation uses a boundary correction kernel. 

Table 2 below summarizes the results. With the Mahalanobis depth, 
f(x; Dn, h) clearly outperforms f(x,h). The average ASE for f(x; Dn, h) is 
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about three times smaller than the average ASE for f(z,h). According to 
the theoretical results of the previous section, for such elliptical densities, 
the difference between the performance of f (2: D,„,h) and f(z, h) should 
grow with the dimension. 


Table 2: ASE of f(x,h) and f(z; Ôn, h). 


Mahalanobis 
f(z,h) Ff Dn, h) 
Average 0.96 0.32 
S.D. 0.28 0.19 


Figure 1 illustrates a very typical outcome of the above simulation. Fig- 
ure 1 a) displays the 50 observations, Figure 1 b) displays the true density, 
Figure 1 c) displays the kernel density estimate f(z, h) and Figure 1 d) 
displays f (x; Dn, h) for the Mahalanobis depth. The estimate f (x; Dn, h) 
is by construction perfectly ellipsoidal and the spurious bumps and dips 
found in f (x,h) have disappeared. Also the bandwidth minimizing the 
ASE for the kernel density estimate is much smaller than the bandwidth 
minimizing the ASE for the Mahalanobis depth density estimate. 


4 Conclusion 


In cases where the unknown density f satisfies f = g o D, our theoretical 
results show that, in high dimension, f (2; Dn; h) has better asymptotic per- 
formance than the usual kernel density estimator f(z, h). Our simulations 
suggest this is true in two dimension, even for small samples. 

The theoretical results also make smoothness assumptions that excludes 
cases where the unknown density f is multimodal. Even though the smooth- 
ness assumptions can be reduced to include such cases, an important prac- 
tical problem remains. For multimodal densities, f, (estimated by the nu- 
merator of f(z; Ôn, h)) and £, (estimated by the denominator of f(x; Ôn, h)) 
are discontinuous functions. Depth based estimation for such densities 


would require the development of a reasonably good density estimator for 
discontinuous densities. 


Appendix: Proofs of Theorems 1 and 2 
This section is devoted to the proof of Theorems 1 and 2. 


Proof of Theorem 1: Define 


fo(r; h) = 33 Kp(r = D(X; )), 


b) True density 


a) Data 


2.5 
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Figure 1: Outcome of the simulation. 
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Line / Ky,(r — D(t)) dt = J K(u) L (r — hpu) du 
and 


P(r) =E Kx(r - D(X1))= | K(u) fo(r — hu) du. 


With this notation, x 
= fo( D(z) h) 
Lp(D(z)) 


and fp(r;h) is clearly a kernel density estimator of fp(r). The asymptotic 
normality of the kernel density estimator is well know so that 


Vah {Fo(rsh) -Fp(r)} Æ N (0, folr) f K) du). 


Therefore (since g = fp /Lp), 


Vinh fo (rsh) - Fo(r) ae 2 (") LN (0, so [ew du) , 


The stated asymptotic normality for f(z; D, h) follows because 


f(z; D,h 


ee: z 26) opto Lo(r) ~ fo(r)£p(r) 2 foley (r) 


TO 
— pio fu2K(u)du. 


Lemma A Let yn be some sequence that converges to infinity and assume 
that 


note that 
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and apply Slutsky Theorem. O 


2 
Lemma B If f (K )(u)) du < oo and if the density fp of D(X1) is 
bounded, 


sup h2*Eh (KE (r — D(X))) < 00. 
h 
Proof: Note that 
Eh (KP (r — D(X) =A% f (KP e- folt) dt 
= h?k+1 f p-2k+1) ( K(k) a5) fot) dt 
=f} (K™® (tet) f p(t) dt 
= f (K®(u))” f(r — hu) dt 
< |lfpllo [(K®(u)) dt o 


Lemma 1 Assume that H3 holds. If G, : R? — R has two continuous 
and uniformly bounded derivatives, then for k € {1,2}, 


J J Ga(2,y) KO (u — D(2)) KP w — D(y)) dedy| < 00. 


SUP 
Proof: By virtue of H3, 

SÍ Gn(a,y) KË (u — D(a)) KP (v — Dy) dedy 

=J SSS Ga(T(8s,0),T(t,02)) KẸ (u — D(T(s,01))) KM (v — D(T(t, 62))) 


|J,(s, 6,)| J (t, 02)| dO; dO dsdt 
= f [Gn(s,t) KP (u — 8) KP (wv — t) dsdt 


with 
Ges J J Gn (T (8,01), T(t, 02)) |J, (8, 01)| |J, (t, 02)| d0, d2. 


Since T, Jp and Gn have two continuous and uniformly bounded deriva- 
tives, so does Gn and the result easily follows from 


JÍ Gn(z, y) K (u — D(z)) KP) (v — D(y)) drdy 
= f f Gr(s,t) K™ (u — 8) KP (v — t) dsdt 

=J Sa a Gr(s, t) Kn(u — s) Kr (v — t) dsdt 
= f J & AGiu-s,v—-t) Kr(8s) Kn (t) dsdt 
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Lemma 2 Assume that H3 holds. If H, : R? — R has two continuous and 
uniformly bounded derivatives, and if f has two continuous and bounded 
derivatives, then for k € {1, 2}, 


sup [E Hn (Xi, X2) Ky” (u - D(X1)) Ky (v — D(X2))| < ov. 
Proof: Note that 
E Hp(X1, X2) Ky) (u— D(X1)) KP (v — D(X2)) 
= f f H„(z,y) KẸ (u — D(2)) KY? (v — D(y)) f(a) Fly) dedy 


and apply Lemma 1 to Gn(z,y) = An(z,y) f(x) f(y). O 

Proof of Theorem 2: First note that - v)| < 2||Dn — Dllo and 
that Vnh ||D, — DIS, = op (h$ ) implies Vnh ||D, — DI, = = 0, (h*) and 
| Ôn — D||ə = O, (h). Using Lemma A, we can prove that f(x; Ôn, h) and 
f(x; D,h) have the same asymptotic distribution by showing 


Vnh{ f5(Dn(x);h) - fo(D(a);h)} > 0 (1) 
and 
Vnh{Le(Dn(z)) - £p(D(x))} > 0 (2) 


We use Taylor series approximations for the numerator (1) and the denom- 
inator (2) separately. For the numerator, 


f (Dn(z);h) -E iS Ak (x, X;) Kt") (D(x) — D(X:)) 
+25 AR, Xi) KPO) 


for some 6 on the segment joining D(x) — D(X;) to Dn(r) — Dn(X;). For 
the remainder, we have 


|KO loo 


3 A A 
An (2, Xi) Ky” (6) DD 


and Vnh ||D, — DIIS, = 0,(h*) ensures that the remainder is negligible. 
Thus, (1) follows from 


hE (> E i 
n SO Arle, Xi) Ky (D(z) — D(X:))) — 0 
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for k = 1,2. The above can be written as a double sum over all indices i 
and 7. For i = 7, we have 


nh 3, E AZ (z, X1) (Kp? (D(a) - D(X) 


< h (2lbez Dle) En? (K{ (D(x) - D) 


which converges to zero (invoke Lemma B) provided ln — Dllo = Op (h) 
(a consequence of Vnh || Dn — DIIS, = 0,(h*)). When i 4 j, we have 


nh “OY E Ak(x, X1) K (D(2) - D(X1)) Ak(@, X2) KP (D(x) — D(X2)) 
= h E Hnk(Xı, X2) Ky’ (D(a) - D(X1)) Kt (D(z) - D(X2)) 
which converges to zero as well because according to Lemma 2, 


sup |E Hnx(X1,X2)Ky. (D(z) — D(X1)) Kj. (D(@) — D(X2))| < 00. 


For the denominator, we consider the expansion 

2 
— A k A 
£,(Da(z)) = D J AK (x,t) KẸ (D(«)—D(t)) dt+ J A$ (x,t) KÈ (bx) dt 


for some ĝ; on the segment joining D(x) — D(t) to D,(x) — Da (t). Note 
that since K has a bounded support, H4 implies that for n large enough, 
K (0+) almost surely vanishes outside of a bounded set so that (2) follows 
from 


Vnh A? (z,t) Ky, (6) Fo 


for almost all t and 


nh E ( J Ak (x,t) KÉ (D(a) — D(t)) it) 0 


for k = 1,2. The proof of the above is similar to the one we made for the 
numerator but uses Lemma 1 instead of Lemma 2. O 
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Abstract: This paper surveys some recent results on mode and concen- 
tration estimation in multidimensions, including excess-mass sets and 
multidimensional quantiles. Extensions of these estimators to the hyper- 
sphere are developed here. In particular, the modal direction is measured 
by the center of a minimal cap, and concentration is measured by a func- 
tion of the opening of that cap. For samples from a distribution for which 
the minimal cap is unique, it is shown that the center and the opening 
of the empirical cap are strongly consistent estimators for their respec- 
tive parameters. Rates of convergence and limiting distributions of the 
estimators are established by means of empirical process theory. 


Key words: Multidimensional mode estimation, directional data, cube- 
root rates, empirical processes. 
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1 Background 


A variety of location estimators for multidimensional data have been re- 
cently proposed and investigated. Examples of extensions of the median to 
higher dimensions include the Lı median (Brown, 1983, and Ducharme and 
Milasevic, 1987), Oja’s simplex (Oja, 1983), the halfspace median (Donoho 


1Milasevic died while this manuscript was in preparation. Nolan would like to dedicate 
it to his memory. 
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and Gasko, 1992), and the simplicial depth median (Liu, 1990). For an 
overview of recent results in this area see Small (1990). Small (1990) also 
discusses extensions of these notions of the median to directional data, and 
Liu and Singh (1992) investigate them in greater detail. Asymptotic prop- 
erties of the simplicial depth were studied by Arcones et. al. (1994) and 
Diimbgen (1992), those of the halfspace median have been studied by Nolan 
(1992), and Chaudhuri (1996) has presented a general approach to studying 
quantiles in multidimensions. 

Parallel to the development of the multidimensional median there have 
been investigations into the properties of estimators of the mode and con- 
centration. Chernoff (1964) and Venter (1967) estimate the mode of a den- 
sity function in one dimension by the center of the interval of fixed length 
to contain the greatest number of observations and by the center of the 
shortest interval to contain at least half of the observations, respectively. 
Sager (1979) generalized these univariate set statistics to the multidimen- 
sional case. He estimates the contours of a unimodal density by a sequence 
of nested convex sets. The first and largest set is the smallest convex set 
to contain a fixed proportion q of the observations; the second set is the 
smallest convex set that contains proportion q of the observations within 
the first set, and so forth. Eddy and Hartigan (1977) proposed a simi- 
lar multidimensional estimator. The asymptotic properties of the shorth, 
the shortest interval to contain at least half of the observations, were in- 
vestigated by Griibel (1988) and Kim and Pollard (1990). Griibel (1988) 
handled the length of the shorth, and Kim and Pollard developed theory 
for cube-root rates of convergence to address the center of the shorth. Ein- 
mahl and Mason (1992) produced asymptotic theory for generalizations of 
the length of the shorth. 

Close relatives to these estimators of the mode are contour estimators 
based on excess-mass, proposed independently by Hartigan (1987) and 
Müller and Sawitzki (1991). An excess mass set for a distribution is the 
set that maximizes the difference between the probability content of the set 
and a multiple of its Lebesgue measure. 

Nolan (1991) considered the properties of these sets when restricted to 
ellipsoids, and found the parameters of the ellipsoid have cube-root rates 
of convergence. Polonik (1995a,b) provides a comprehensive investigation 
into the properties of these excess mass sets, their connections to maximum 
likelihood estimation under shape restrictions, and their use in tests of 
multimodality. He (Polonik, 1995b) shows that the excess mass sets can be 
used to form a density estimator. The estimator coincides with Grenander’s 
estimator in one dimension when the sets are restricted to intervals with 
left endpoint 0, and with Sager’s estimator in higher dimensions. 
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2 The hypersphere 


Here, we consider the extension to directional data of the shorth’s intuitive 
geometric approach to measuring location and concentration. Location and 
concentration of the distribution are derived from the smallest cap on the 
sphere that has probability content at least a, for 0 < œ < 1. When unique, 
the center of this minimal cap can be interpreted as a modal direction of 
the distribution and the cosine of the half-angle of this cap is a measure of 
concentration. Given a sample of size n on the sphere, the modal direction 
and concentration can be estimated from the empirical minimal cap, the 
smallest cap that contains at least na observations. 

The method proposed here is in some sense a geometric counterpart 
to Watson’s estimator (Watson, 1983, Chapter 5) because it too can pro- 
vide information on location for axial distributions. For example, with the 
Scheidegger-Watson, Arnold and uniform distributions, there is no unique 
minimal cap, but the collection of centers of the caps provides meaningful 
modal directions. For the Scheidegger-Watson distribution the collection 
of centers is the axis of symmetry; for the Arnold distribution it is the 
plane of the girdle and for the uniform distribution it is the entire sphere. 
The minimal cap differs from Watson’s estimator and other current esti- 
mators of location and concentration on the hypersphere (Ducharme and 
Milasevic, 1987, Fisher et. al., 1987, Watson, 1983) in its geometric rather 
than metric nature. As with other mode estimates, the convergence of the 
center of the sample minimal cap is at a cube-root rate. The concentration 
of the cap however has a square-root rate of convergence, as it is similar in 
behavior to a quantile estimator. 

To formally define the minimal cap, we introduce some notation. Let S 
denote the Euclidean unit sphere S?~! in RP and C(u,t) the (hyper)spherical 
cap with center u and half-angle arccos(t). That is, C(u,t) = {xe S: 
u'x > t}, for t € [-1,1]. To simplify notation, given a distribution F' on 
S we shall write F(u,t) for F(C(u, t)). 


Definition 1 Let0 <a <1 and F be a probability measure on S. A cap 
C(uo,to) is called a minimal a-cap of F if F(uo,to) = a and if for each 
t >to, sup,eg F(u,t) <a. 


Definition 2 Given a set X™ ofn points on S, Cn = C(un, tn) is called 
a minimal a-cap of X\™ if it is a minimal a-cap of the empirical measure 
F,, based on Xx, 


Note that to = sup{t : sup,cg F(u, t) > a}, for 0 < œ < 1. We shall call 
this value the a-concentration coefficient of F. If C(uo,to) is the unique 
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minimal a-cap of F then to is the (1 — a)-quantile of the variable upX, 
where X has distribution F. 

The minimal a-cap of a distribution F’ is unique with center ug € S if 
and only if F satisfies the following property: 


(A) For every u~ug and every t such that F(u,t) > a, there exists t >t 
such that F(uo,t’) > a. 


We shall denote by M,(uo) the class of absolutely continuous distribu- 
tions on S verifying property (A) and by M(up) the class Na Ma(uo). An 
absolutely continuous distribution F' belongs to the latter class if and only 
if it satisfies the following “unimodality” property: 


(B) For every u#ug and every t, there exists t > t such that F(uo,t’) > 
F(u, t). 


An absolutely continuous distribution with unimodal density does gen- 
erally not satisfy property (B). It is however the case if moreover F is 
rotationally symmetric about the mode of its density. The Langevin dis- 
tribution is such an example. Its density is proportional to exp(kupz), 
and so is both unimodal at ug and rotationally symmetric about up. Not 
surprisingly, the a-concentration coefficient is a strictly increasing function 
of the concentration parameter k appearing in the density. For p = 3 we 
have tg = k~!log{e* — 2asinh(k)]. On the other hand, if f is rotationally 
symmetric and bimodal, then there are two minimal a-caps C(uo, to) and 
C(—uo, to), which provide a unique axis of rotation. Finally, the normalized 
mean (see Watson, 1983) and the normalized median (see Ducharme and 
Milasevic, 1987) both coincide with ug if F belongs to M(upo). 

The following algorithm will generally allow us to find a minimal a- 
cap of a set X™ of points on S. It is stated for the case p = 3 but can 
be generalized to any dimension. It points out the similarity between the 
minimal cap and the minimum-volume sphere. 


S1. For each triple of points of X‘™ consider the circle on S? deter- 
mined by them. Sort the circles according to the number of elements of 
X) in them, and for each [an], choose among the caps with [an] elements 
those with minimal opening. 


S2. For each pair of points of X™ consider the smallest circle on S? 
containing the two points and then proceed as in S1. 


S3. Let Că be the smallest of the two caps obtained in S1 and S82. 


Proposition 1 If X (n) is in general position, i.e. no more than p points 
of X™ lie on an affine hyperplane, then C* is a minimal a-cap of X®. 
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The proposition states that every minimal cap can be found via the 
above algorithm. This follows from the fact that the boundary of a minimal 
cap must contain two or three points of X., Otherwise, there would exist 
a smaller cap with the same number of elements, which contradicts the 
minimality property. 

To reduce computation time, the algorithm proposed by Rousseeuw and 
Leroy (1987) can be adapted to this problem. Rather than searching over 
all caps determined by the (3) + (5) subsets of observations, a set of m 
triples of observations can be selected at random. For each triple, the 
corresponding cap is computed; then the cap is shrunk or expanded until 
it contains [an] points. The minimal cap is chosen from among these m 
a~caps, and the order of operations is now reduced to O(mn). 


3 Asymptotic properties 


Let X”) be an iid. sample of size n from F € Ma(uo), and let C(un, tn) 
be a minimal a-cap of X). Note that X‘™ is almost surely in general 
position. We then have the following 


Proposition 2 Assume that F(uo,to—65) > a for each (admissible) 6 > 0. 
Then i) tn — to almost surely as n — œ and ti) Un — uo almost surely 
as n — o. 


In the case where F is rotationally symmetric about uo, its density is of 
the form f(upx) for some suitable function on [—1,1]. The assumption of 
the proposition is then satisfied if, for example, f is continuous at to and 
f (to) > 0. 

Note that un can be defined as any unit vector maximizing the (1 — a)- 
quantile of the set {u’Xj,..,u’Xn} and that tn is the value of this maximal 
quantile. We shall prove that the asymptotic distribution of tn is the same 
as that of the empirical (1—a)-quantile of the variable up X. More precisely, 
the following result holds. 


Proposition 3 Assume that F € ()\g-aj<eMa(uo) for some e > 0 and 
that the density po of ugX is positive and continuous at to. Then 


a(i — 2) ) 
po(to) © 
Now turn to the behavior of un. For the next result, we assume that F € 
M,,.(uo) and that F has a rotationally symmetric density of the form f (ugs) 


for some f defined on [—1,1]. Moreover, we only consider the cases where 
p >3anda < 1/2. The case p = 2 amounts to the one-dimensional context 


Vn(tn = to) ZERA N (0, 
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of the shorth, which is treated in Kim and Pollard (1990). The restriction 
on a ensures to > 0, and greatly simplifies the covariance structure of the 
limit process. 

Denote by I, the q-dimensional identity matrix and by B,_1 the unit 
closed ball {u* € R?-!;|u*] < 1}. For a measurable subset E C $1, 
Area( E) represents the q-dimensional volume of E. Given a function h 
and a distribution P, write Ph for the expectation of h under P. Let Z(x) 
be a Gaussian process indexed by R?—! with continuous sample paths, zero 
expectation and covariance kernel 


T(z,y) = (p — 2)! Area(S?-8) f(to)(1 — t) ({2| + ly| — le- yl). 


Proposition 4 Assume that f is twice differentiable on (—1,1) and that 
f'(to) > 0. Then 
L 


where Lmaz is the almost surely unique vector maximizing Z(x) + ty Ur, 
with 


U=- - Area(S?™) f'(to)(1 — ad 


Corollary 1 Under the hypotheses of Proposition 4, 
2n?/3(1 — uhun) |2mazl”. 


The two different rates of convergence, cube-root for the direction and 
square-root for the concentration, parallel those of the center and length of 
the shorth. The minimal caps can be extended to the excess-mass approach 
by finding the cap that maximizes 


F(u, t) — aArea(C (u, t)). 


In this case, both the direction and opening of the excess-mass cap would 
have cube-root rates of convergence. Properties of the excess-mass cap are 
not addressed here. 


4 Proofs 


Proof of Proposition 2 


i) The class of sets C = {C(u,t);u € S, —1 < t < 1} is a Vapnik- 
Cervonenkis class and thus (e.g. Pollard, 1984) 


Yn = sup | Fha (u, t) — F(u, t)| 0. 
u,t 
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It follows that for each 6 > 0, Fr(uo,to — 6) > a eventually, almost 
surely. This implies that, for n large enough, tn > to — 6 except on a set 
of measure zero. Moreover, the almost sure convergence of 7, to 0 implies 
that 

sup |Fn(u, to + 6) — F(u, to + 6)| +50, 
u 


which in turn gives the almost sure upper bound tn < to + ô for n large 
enough. 


ii) Note first that for each ņ > 0 there exists € > 0 such that 


sup F(u,to) = a-e. 
lu—~uo| 2n 


It then follows from continuity of F(.,.) that there exists 6 > 0 such that 
Sup F(u,t) <a -— €/2. 


||[u—uo || >n,|t—-to| <6 
Consequently, from i, 


sup F(u,tn) < a—e/2 
||u—uoll>n 


eventually, almost surely. The latter inequality and yn 50 imply that 
[un — Uo|| < 7, almost surely. O 


Proof of Proposition 3 


The idea is to wedge t, between two random variables having the same 
asymptotic behavior. Let s, be the (1 — a) sample quantile of {upX1, .., 
upXAn}. Then by definition 


Fn (uo, Sn) = a+ Op(1/n) 


and it follows from the minimality of C(un,tn) that sn < tn. 
The hypotheses of the proposition entail (e.g. see Serfling, 1980) that 


a(1—a) 
Vn(sSn — to) — N(O, “play 
Now we find an upper bound for ./n(tp, — to) with the same asymptotic 
distribution as ,/n(s, — to). We have 
a+Op(1/n) = Frlun,tn) 
= F(un,tn) + (Fn — F)(un, tn) 
< F(uo, tn) + (Fn — F) (uns tn) 
a — po(to)(tn — to) + op(tn — to) + (Fn — F) (uo, to) 
+op(n-/2), 
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The inequality follows from the hypothesis on F and consistency of tp. 
It holds on a set An of probability tending to one. Note here that the 
hypothesis of Proposition 2 is satisfied. The last op term follows from 
consistency of un and tn and the fact that the process yn(Fn — F) indexed 
by the class C is stochastically equicontinuous. This sequence of inequalities 
imply that on A, 


Vn(tn 7 to) < : 


po(to) 


Therefore, the random variable on the right hand side converges weakly to 
the desired normal distribution (Pollard, 1984, Theorem VII.21). O 


Vn( Fn = F)(uọ, to) + op(1). 


Proof of Proposition 4 


We may assume without loss of generality that uo = (0,...,0,1)’. For 
u* € By-1, let u = u(u*) be the point on SP! above u*, i.e. u = 
(u*,,/1 — |u*|2)’. For n large enough, un can be uniquely represented as 
Un = uļ(už ) for some už € B,_1. Note that the north pole ug corresponds 
to u(0). Next, define 


W(.,u*,5) = C(u(u*), to + 8) — C(uo, to + ô) 


which is to be understood as the difference of the indicator functions of the 
corresponding sets. 

Note that už is a solution of the maximization sup, F,W(.,u*,tn—to). 
To prove this proposition we shall use the main theorem of Kim and Pollard 
(1990). To do so, we need to establish two results. The first is that up also 
maximizes F,,W(.,u*,0), i.e. 


F,W(.,ux,0) > sup F,W(.,u*,0) — o)(n7?/3). 
To show this we need the following lemma. 


Lemma 1 Define the function M : Bp-1 x [-1,1] — [0,1] by M(u*,t) = 
F(u,t) and let 


1 

WE) = =z Area(s?) F E — BYP”, 

Then we have the expansion 

1 8? 

2 Op to 
1 

+o) lu"? + 0(62) + oflu" l). 


M(u",t +8) = a + Čl M(0,t) ô+ 


2 
i M(0,t) 6 
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Proof: We have M(u*,t) = fotut) f(uoTur)dS(x) where Tu is the rota- 
tion mapping up to u. Note that Tuo = Ip. Taking into account the fact that 
uo is the north-pole, that the p-th row of the matrix Ty is (—u*, ,/1 — Ju*]), 
and using the measure decomposition dS?—! = (1 — s?)\(—3)/2ds @ dsp? 
(see e.g. Watson, 1983), we obtain after some calculations that for each t 
and each j,k € {1,...,p — 1}, 


82 
Vlur=0 M(u*,t) = 0 = ———|,._,M(u* 
0 (u ) Ou* dux u*=0 (u ,t) 
and 
9 ; 
3u? „oM (u st) = y(t). O 
J 


As a consequence of Lemma 1, 


* 1 * * 
FW(.,u*,6) = 5y(to)lu*|’ + o(8) + o(u*|?). 


Use this equation, the fact that y(to) < 0, and stochastic equicontinuity 
of n2/3(F,, — F) (which follows Kim and Pollard) in order to obtain the 
inequality | 


F,W(.,ux,0) > sup Fa W(.,u*, 0) — op(n77/8). 


The second result that needs to be established is that the limiting co- 
variance is, for each x,y € RPTL, 


T(a,y) = lim 6! FW(.,2,0)W(.,y,0). 
Let t > 0 and, given two vectors u,v € S, consider the set 
J(u, v) = Clu, t) \ (Clu, t) N C(v, t). 


Write A¿(£) for its area, where € denotes the angle between u and v. To 
obtain this result, we need the following lemma, which has interest in its 
own right. 


Lemma 2 The value of the right hand derivative of A; at E = 0 is given 
by 


A‘(0) = (p—2)7! Area(S?-3) (1 -tA T. 
Proof: We shall compute the right hand derivative of B;(£) = Area(C(u, t) 
NC (v,t)). Note that Aj(0) = —Bj(0). The cap C(u,t) can be viewed as 
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a (p — 1)-dimensional spherical disc on S with radius r = arccos(t). Its 
boundary is a (p — 2)-dimensional sphere on S with center u € S and 
(spherical) radius r. We shall denote it by S?~?(u,r) and we shall denote 
by S?—-?(u,r,d) a cap on S?~?(u,r) with half-angle ¢ (there is no need for 
our purposes to specify its center on S?~?(u,r)). We have 


Area(S?-*(u,r, ¢)) = sin”™?(r) Area(S?~3) r sin?°(0) dé. 


Now let S+ be the hemisphere of S determined by S?~?(u,r) N S?~?(v,r) 
which contains v. The Riemannian distance from u to St is given by 
a = £/2. Denote by L?~'(a,r) the intersection of C(u,t) and St. Then 
B,(€) = 2 Area(L?~*(a,r)). 

Now, LP- (a,r) is composed of caps of the form S?~*(u,r, 6(p)), where 
a <p <r and where ¢(p) is given by the spherical trigonometry formula 


cos(a) — cos*(p) 


tan(¢(p)) = sin(a) cos(p) 


It follows that 
ee ae oP) gs 
Bi(€) = 2 Area(S? J) sin? (0)( | sin?" °(0) d0)dp 
a 0 
and we obtain after some computations that 


Bi(0) = — 


: 5 Area(s?) sin?-?(r). O 


Now, for @ small enough, denote by zg, yg the vectors on S?-! which 
are above $x, By. Then, from continuity of f and Lemma 2, we obtain 


lim B! FW(., Bx, 0)W(., By, 0) 
= an B~* (F (Jeo (xp, Uo)) + F( Ji (uo, ¥a)) — F(Jt (£p, yp))) 


= f (to) Aj, (0) lim B~*[E(xg, uo) + €(uo, ys) — Elze, ya)] 
=r y) o 
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Abstract: We describe a general approach to finding Least Absolute Devi- 
ation estimates of two-way and three-way overlapping clustering models 
called ADCLUS and INDCLUS. The suggested approach utilizes a com- 
binatorial optimization approach that takes advantage of a separability 
property of this loss function for fitting these models. Our approach 
helps in robustifying the solutions in the presence of extreme outliers in 
the data. 
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1 Introduction 


Shepard and Arabie (1979) first introduced an overlapping clustering model 
for classification based on similarity data called ADCLUS (for additive 
clustering). Subsequently, Arabie and Carroll (1980), provided a mathe- 
matical programming approach for fitting this model. Carroll and Ara- 
bie (1983), proposed an individual differences generalization of this model, 
which they called INDCLUS (for individual differences clustering), and also 
devised a procedure for fitting this three-way generalization, thus provid- 
ing a methodology for three-way overlapping clustering. These algorithms 
optimize a least squares, or L2-norm based, loss function. The theoreti- 
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cal significance and conceptual elegance of overlapping clustering can be 
seen in applications in many different substantive domains, as illustrated, 
for example, in Arabie, Carroll, DeSarbo, and Wind (1981) and Srivas- 
tava, Alpert, and Shocker (1984). In this paper, we present a procedure 
for fitting these models via an Lı-norm, which we call LADCLUS (for 
Least Absolute Deviation clustering). This procedure is a special case of a 
general approach (Carroll and Chaturvedi, 1995) for fitting a general mul- 
tilinear model including both discrete and continuous parameters via L4- 
or Lo-norms, or other Lp-norms. Our procedure is computationally signif- 
icantly faster and can handle much larger data sets than Lakshmi-Ratan’s 
(1985) L,-norm based approach for fitting the two-way overlapping cluster- 
ing problem. It is computationally as simple and tractable as a procedure 
proposed by Chaturvedi and Carroll (1994), which provides a more efficient 
algorithm than the earlier algorithms for fitting these models via an Lə- 
norm. The principal benefit of fitting these overlapping clustering models 
via an Lı-norm would be to robustify the estimation of model parameters 
vis-a-vis extreme outliers in the data. Fitting models via an Lj-norm tends 
to reduce the effects of extreme outliers, as compared to fitting via an Lo 
(OLS)-norm (Hampel, Ronchetti, Rousseeuw and Stahel, 1986; Kaufman 
and Rousseeuw, 1990). The increased robustness of the L;-norm procedure 
to the Lə-norm procedure is analogous to the illustration in Rousseeuw 
and Leroy (1987, pp. 10-11) in the context of linear regression, wherein the 
L--norm estimate is shown to be more robust to extreme values of the de- 
pendent variable. The derived solutions would be more robust to extreme 
values in the data (analogous to the dependent variable in the regression 
case). 


2 The INDCLUS model 


Assume that N objects are being clustered into R overlapping clusters. 
Then, the INDCLUS model (Carroll, 1975; Carroll and Arabie, 1983) is 


written as: 


Sk = PW,P’ + Cp + error, (1) 


where: 

S is an (N x N) similarity matrix for the kth subject (or other source 
of data); k = 1,...,K, 

W;, is an (R x R) diagonal matrix of weights for the kth subject (or 
other source of data); k = 1,..., K, 


P is an (N x R) binary indicator matrix defining the possibly overlapping 
clusters, and 
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Cx is an (N x N) matrix, all of whose entries are cg, which can be 
thought of as the weight for a universal cluster denoted by an N x 1 unit 
vector 1, all of whose components are 1. 


The diagonal entries in the (N x N) matrix S, are usually not defined. 
The estimation problem is to determine the ordinary least squares (OLS) 
estimates of parameters P, W, and C}. The diagonal elements of the 
weight matrices W, must be non-negative and elements of P must be con- 
strained to be either 0 or 1. The ADCLUS model is the special case of the 
INDCLUS model in which K = 1. The ADCLUS procedure proposed by 
Shepard and Arabie (1979) combines a combinatorial algorithm with iter- 
ative estimation of the weights. The MAPCLUS procedure proposed by 
Arabie and Carroll (1980) fits the ADCLUS model via a penalty function 
based mathematical programming technique embedded in an overall alter- 
nating least squares procedure. The INDCLUS method proposed by Carroll 
and Arabie (1983) generalizes the MAPCLUS approach to the three-way 
INDCLUS model. These are all OLS procedures (using an L2-norm based 
fit measure). Various other techniques have been developed for fitting these 
models, such as the Maximum Likelihood approach of Hiroshi Hojo (1983), 
the Qualitative Factor Analysis (QFA) procedure of Mirkin (1987), and the 
OLS procedure called SINDCLUS, of Chaturvedi and Carroll (1992, 1994). 
We now describe the LADCLUS algorithm for fitting these models via an 
£y4-norm. 


3 The LADCLUS algorithm 


The LADCLUS procedure uses a separability property of the LAD loss 
function for fitting the INDCLUS model defined in (1) via an alternating 
estimation procedure discussed below. This separability property has also 
been used in the SINDCLUS procedure (Chaturvedi and Carroll, 1994). 
While Mirkin (1990) also uses a one-cluster-at-a-time approach to esti- 
mating a bilinear model, his approach utilizes a different algorithm for 
estimating the parameters for a cluster. The two main differences of the 
approach presented in this paper with Mirkin’s (1990) approach are: (a) 
Mirkin does not use the separability property, but a different approach to 
estimating the parameters for any given cluster, and (b) his approach does 
not yield overall L;-norm based estimates for the model parameters, since 
he does not iterate over clusters (as we do) in order to obtain a globally op- 
timal solution. This separability property can best be stated in the form of 
two procedures - the elementary discrete and elementary continuous LAD 
procedures. These are illustrated below. 
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A. The elementary discrete LAD procedure 


Consider the following illustrative problem. Let 


753839 Tı 
8 6 5 1 T2 / 

ae a: ee a T | 9 3 | 
5 3 4 6 LA 


The estimation problem is to find the least absolute deviation (LAD) esti- 
mate of x, where L = xr’+ error and x is constrained to be binary (0 or 
1). That is 


~ 59 3 9 lxi Ar, 974 321 
8 6 5 1 = 1x2 Ar) 979 329 
9 4 2 7| | 1z} 423 923 323 i 
5 3 4 6 lz4 Ara 9x4 324 
If we let 
fi = |7—1a,|+|5—42, || 3-92; | + | 9-32; |, 
jo. = | 8 — 1x2 | + | 6 — 4x2 | + | 5 — 9z2 | + | 1 — 322 |, 
fs = |9- 1z |+ |4- 4z | + |2 -— 9z3 | + |7 — 3z3 |, and 
fa = |5- 1z, |+ |3- 4zr4 |+ |4- 9z, |+ |6- 3z |, 


then the sum of absolute errors is given by 


F= fi+ f2 + fs + fa. 


Note that fı is a function only of x1; fo is a function only of ro; fz is a 
function only of x3; and f4 is a function only of 24. Thus, F is separable in 
£1, T2, £3, and x4. To minimize F, one can separately minimize fı w.r.t 21, 
fo w.r.t £2, fg w.r.t £3, and, f4 w.r.t. x4. To minimize, say, fı w.r.t. 21, one 
can easily evaluate fy at zı = 1 and zı = 0. The zı yielding a minimum 
of these two possible values is then chosen. Thus, for I (0-1) variables, 
only 2I function evaluations and comparisons are needed, as compared to 
2/ evaluations and comparisons for explicit enumeration. 


B. The elementary continuous LAD procedure 


Again, consider the illustrative problem given in the Elementary discrete 
LAD procedure of determining x, where x is real. As in the case of the 
Elementary discrete LAD procedure, it can be shown that F is separable 
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in £1, 2, £3, and x4. Thus, to minimize F, one can separately minimize 
fi wrt x1, fo wrt £2, f3 w.r.t £3, and, f4 w.r.t. 24. The minimization of, 
say, fı w.r.t. 21, is equivalent to performing simple Lı regression with a 
single independent variable. While the general multivariate Lj-regression 
problem can be formulated and solved as a constrained linear programming 
problem (Rousseeuw and Leroy 1987 pp 146), in our case since there is only 
one independent variable, the simplex solutions can easily be determined 
by evaluating the functions fı — f4 at the respective corner points. To 
illustrate, in order to minimize, say, fı w.r.t. x1, simply evaluate fı at the 
four corner points given by zı = 7/1, 5/4, 3/9 and 9/3. The value of xı 
yielding the minimum is chosen as the optimum LAD estimate of zı. In 
this specific case, fı is a minimum at zı = 3/9. More generally, to minimize 
the LAD criterion as a function of a single variable x;, for the component 
fi of the overall loss function 


J 
where fi = >>| li; — zir; | is a function only of x;, we simply evaluate f; at 
j=l 
the J values 


„0 — fst 
a — re 
J 


and then choose the 7; to be the oJ ) minimizing 


J , 
fit ) DA l are zË In, | . 
j=l1 


This can easily be shown to provide the x; minimizing f;, thus completing 
the Elementary Continuous LAD procedure. 


The LADCLUS procedure utilizes an alternating least absolute devia- 
tion (Lj-norm) procedure in fitting the ADCLUS/INDCLUS models. The 
algorithm converges to at least a local optimum, as the objective function 
value decreases (or does not increase) at each stage of the algorithm, and 
there is a lower bound to the objective function. 

The estimation problem in LADCLUS is to find LAD estimates of P, 
W, and Cx, (k = 1,..., K) in the equation: 


Spe PW,P’ + C; + error, (2) 
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where P, W;, and Cx, are as defined in equation (1). If S were considered 
to be a nonsymmetric (N x N) matrix, then the above equation can be 
generalized to 


Sk = PW,Q’ + Cp + error, (3) 


where Q is an (N x R) matrix, not necessarily the same as P. In the 
symmetric case of INDCLUS, P = Q. 

We use (3) to estimate the parameters for the symmetric case without 
imposing the constraint P = Q. Thus, we will first define the estimation 
problem for the nonsymmetric case, and then specialize it to the symmet- 
ric case. This is the same overall strategy as followed by Chaturvedi and 
Carroll (1994) in their SINDCLUS approach to OLS estimation of AD- 
CLUS/INDCLUS. Defining the following symbols : 

Pr = (N x 1) binary vector for the rth cluster, 

w, = (K x 1) vector of the weights for the rth cluster, 

qr = (N x 1) binary vector for the rth cluster (not necessarily = p,), 

P(_,) = (N x R) binary matrix including the universal cluster 1 but 
excluding the rth cluster, 

W(-_,) = (R x R) weight matrix for the kth subject or other source of 
data, including the weight for the universal cluster but excluding the weight 
for the rth cluster, and 

Q._,) = (N x R) matrix including the universal cluster 1 but excluding 
the rth cluster, 


we can rewrite (3) as 


Sk = PrWerd + P(r) Wer) Q(_,) + error, 


If we have estimates of all but the rth cluster, then we can define S, as 


Sk = Sk — Pr) Wir) Ar) (4) 
to get 
Sk = PrWerd). + error, (5) 
that is, 
S, = f (parameters for cluster r) + error (6) 


If we have K matrices S;, of order (N x N), we can use a procedure similar 
to the CANDECOMP based algorithm called the INDSCAL method for 
fitting the INDSCAL model formulated by Carroll and Chang (1970). We 
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call this general procedure CANDCLUS (Carroll and Chaturvedi, 1995), 
which stands for CANonical Decomposition CLUStering. Let us assume 
that we have the parameter estimates for all clusters, except the ith cluster. 
Let T; be a (K x N?) matrix where the kth row has all the N? terms of 
the N x N matrix Sp, and Tz and T; be (N x KN matrices). T; has all 
N? elements of S+ in its kth row. To is the supermatrix that has the jth 
row of matrices S;,...,S% in the jth row. Thus: 


T2 = [81 | S52]... S| | Sx] 
Similarly, 
T3 = IS; |55 |... Slee | Sx] - 


Assuming that estimates of pr and q, are known, the parameters for the 
rth cluster are estimated by iterating the following 3 Steps until at least a 
local optimum is reached. 


e Step 1. Estimating w, conditionally 


Given current estimates, P, and q,, of p, and qr, let g, be a vector of N? 
elements such that 


gr = Pr 8 ĝr, 


where ® is the Kronecker product. Then, using the elementary continuous 
LAD procedure outlined earlier, one can find the LAD estimate W, in the 
Equation 


Tı = w,gi. + error. 


Non-negativity constraints are imposed easily by simply setting all negative 
weights to zero as in Carroll, DeSoete, and Pruzansky (1989). 


e Step 2. Estimating p, conditionally 


Given current estimates, W, and q,, of w, and qr, let h, be a vector of 
KN elements such that 


h, = W, ® qr. 
Then, by using the elementary discrete LAD procedure in the equation 
Tə = prh’. + error, 


one can find p,, the LAD estimates of pr. 
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e Step 3. Estimating q, conditionally 


Given current estimates, w, and Pr, of w, and pr, let jr be a vector of KN 
elements such that 


Jr = Wr & Pr. 
Then, by using the elementary discrete LAD procedure in the equation 


Ts = qj} + error, 


one can find q,, the LAD estimates of qr. 


The LADCLUS algorithm starts off with random binary starting values 
for the matrices P and Q, and random continuous values in the diago- 
nal matrices W}. The LADCLUS procedure iterates up to a prespecified 
maximum number of major iterations or until the percentage reduction in 
the fit value is less than a prespecified criterion. The current default is 
0.0001. The current fit value (M2) and the previous fit value (M1) are up- 
dated after each major iteration. Each major iteration of the LADCLUS 
algorithm involves determining at least locally optimal conditional Li-norm 
estimates of the parameters for the r+1 clusters, using the one-cluster-at-a- 
time strategy. Thus, each major iteration comprises r+1 minor iterations, 
corresponding to the r + 1 clusters. 

The minor iteration of the LADCLUS procedure corresponding to the 
rth cluster involves sequentially finding conditionally optimal parameter 
estimates w;,, pr, and qr. This is achieved by first forming supermatrices 
Tı, Tə and T3. The matrix g, is then formed and W, is estimated as de- 
scribed in Step 1 using the elementary continuous LAD procedure. This 
is followed by Steps 2 and 3 of estimating P, and q, respectively, using 
the elementary discrete procedure. The fit value is then computed (F2), 
and compared to the previous fit value (F1). This process is repeated until 
there is no improvement in fit (i.e. F2 = F1). At this point an (at least 
locally) optimal set of parameter estimates have been obtained for the rt’ 
cluster, conditional on the fixed values of the R — 1 other clusters (and the 
universal cluster). 

It should be noted that in Steps 2 and 3 above, the p and q vectors 
cannot be all zero. Thus, each cluster must have at least one object in 
it. Since the matrices Są usually do not have diagonals in the case of the 
INDCLUS model, Steps 1, 2 and 3 above need to be modified. In the case 
of undefined diagonals, we simply drop the corresponding columns from the 
T; matrix and g vector in Step 1.- Similarly, we don’t consider the diagonal 
elements in Steps 2 and 3 for fitting the INDCLUS model. For estimating 
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the weights for the universal cluster, where p and q are fixed, we just use 
Step 1, with p=q=1. 

While we do not impose the constraint that P = Q in the estimation 
procedure, we find empirically that in the symmetric case of INDCLUS, 
the matrices P and Q are equal upon convergence at the global optimum. 
We conjecture that this will always occur under conditions of unique global 
optimality. Also, while the default option in LADCLUS is to estimate the 
INDCLUS model by treating the diagonals as missing data, LADCLUS 
also allows fitting of the INDCLUS model when diagonals are treated as 
non-missing. One major advantage of LADCLUS in addition to its greater 
computational efficiency, is its ability to handle arbitrary patterns of miss- 
ing data satisfactorily. (In fact, the default option for diagonals simply is 
a special case of handling missing data). Since, at each stage of the algo- 
rithm, we are conditionally estimating one new “dimension” (e.g., the p, or 
qr vector in LADCLUS), through the use of the elementary discrete LAD 
procedure mentioned earlier, omission of data is accomplished by omitting 
the corresponding terms from the corresponding LAD loss function. (The 
treatment of missing data as described is a special case of weighted LAD 
fitting, with weights of zero for missing observations and one for those that 
are present.) The generalization of LADCLUS to weighted LAD, is also 
straightforward, since each of the three conditional LAD estimation stages 
can simply be replaced with an appropriately weighted LAD estimation 
procedure. 

One final comment vis-a-vis LADCLUS is that no special case needs 
to be described for the fitting of the ADCLUS model, in which K = 1. 
ADCULUS is simply fit as a special case of the INDCLUS model, in which 
the third way, for subjects or other sources of data, has only one level. 


4 Applications of LADCLUS to some real data 


The LADCLUS procedure was applied to the Kinship data of Rosenberg 
and Kim published in Arabie, Carroll, and DeSarbo (1987). The application 
of SINDCLUS to this data set is described in detail in Chaturvedi and 
Carroll (1994). We present this application of LADCLUS to compare the 
solutions derived via LADCLUS and SINDCLUS. 

The fifteen most commonly used kinship terms - Aunt, Brother, Cousin, 
Daughter, Father, Granddaughter, Grandfather, Grandmother, Grandson, 
Mother, Nephew, Niece, Sister, Son, and Uncle, were printed on slips of 
paper for use in a sorting task by Rosenberg and Kim (1975). Eighty- 
five male and eighty-five female subjects were run in a condition where 
subjects gave (only) a single-sort of the fifteen terms. A different group 
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of subjects (eighty-five males and eighty-five females) were told that, after 
making their first sorts of the terms, they should give additional subjective 
partitioning(s) of these stimuli using “a different basis of meaning each 
time”. Rosenberg and Kim (1975) used only the data from the first and 
second sortings for this group of subjects. Thus, we have six conditions 
which will correspond to our subjects: females’ single-sort, males’ single- 
sort, females’ first-sort, males’ first-sort, females’ second-sort, and males’ 
second-sort. Again note that the “subjects” (or other sources of data) in 
the first two conditions were distinct from those in the last four conditions. 

Since the subjects’ partitions of the stimuli comprise nominal scale data 
that do not immediately assume the form of a proximity matrix, some 
pre-processing is necessary to obtain such a matrix. If we form a stimuli 
x stimuli co-occurrence matrix for each experimental condition, with the 
(i, j)th entry derived as the number of subjects who placed stimuli z and j 
in the same group, and subtract that entry from the total number of sub- 
jects contributing to the matrix, then we have what is called the S-measure 
(Arabie, Carroll, and DeSarbo, 1987). As in Arabie, Carroll, and DeSarbo 
(1987), the six matrices constructed using the S-measure were analyzed 
using LADCLUS via a matrix unconditional approach. A five cluster so- 
lution explaining 38.75 percent of absolute deviation in the data (around 
the grand median) was extracted. The optimal clusters derived using the 
LADCLUS procedure are identical to the clusters derived by Arabie, Car- 
roll, and DeSarbo (1987). The five cluster solution is presented in Table 
1, while the importance weights derived via LADCLUS and INDCLUS are 
presented in Tables 2 and 3. 

The clusters are easily interpreted. In the order listed, the first two are 
sex-defined, the third is the collateral relatives, the fourth is the nuclear 
family, while the fifth consists of grandparents and grandchildren. 


5 Conclusions 


The LADCLUS procedure introduced least absolute deviations as an ob- 
jective function for the ADCLUS/INDCLUS models. The enhanced ro- 
bustness of LADCLUS to the presence of extreme outliers over existing 
least squares procedures was demonstrated in an extensive Monte-Carlo 
simulation. The LADCLUS procedure can be extended to fit other hybrid 
multivariate models that entail both continuous and discrete parameters, 
where the discrete parameters can take on any discrete values. We hope 
that the LADCLUS procedure further enhances the applicability of the 
ADCLUS and INDCLUS models. While in the current version LADCLUS 
can yield locally optimal solutions, we are investigating approaches that 
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ameliorate these problems and yield globally optimal solutions, even for 
large data sets. 


Table 1: LADCLUS & INDCLUS solutions for Rosenberg and Kim data. 


Cluster Items in Cluster Interpretation 

a Brother, father, Male relatives excluding cousins 
grandfather 
grandson, nephew, 
son, uncle 

b Aunt, daughter, Female relatives excluding cousins 
granddaughter, 
grandmother, 
mother, niece, sister 

C Aunt, cousin, Collateral relatives 
nephew, niece, uncle 

d Brother, daughter, Nuclear family 
father, mother, 
sister, son 

e Granddaughter, Grandparents and Grandchildren 
grandfather, 
grandmother, 
grandson 

f All objects Universal cluster 


Table 2: LADCLUS weights for the Rosenberg and Kim data. 


Subject a b c d e Universal 
F’ single 0.11 0.11 0.44 0.29 0.48 0.04 
M’ single 0.2 0.19 0.29 0.28 0.28 0.05 
F’ first 0.57 0.57 0.27 0.17 0.21 0.05 
F’ second 0.25 0.27 0.33 0.25 0.32 0.09 
M’ first 0.33 0.33 0.28 0.17 0.29 0.08 


M’ second 0.31 0.32 0.11 0.15 0.13 0.16 
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Table 3: INDCLUS weights for the Rosenberg and Kim data. 


Subject a b c d e Universal 
F’ single .052 .049 .552 .478 .626 .055 
M’ single .143 .146 .397 .372 .449 .075 
F’ first .551 .554 .283 .206 .251 .132 
F’ second .241 .246 .373 .322 .385 .158 
M’ first .299 .291 .340 .241 .395 .158 
M’ second .295 .306 .237 .219 .253 .207 
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Abstract: The classification task of hierarchical clustering can be charac- 
terized as one of constructing for an object set S a sequence of successively 
less-refined partitions that attempts to represent the pattern of entries 
in a given symmetric proximity matrix defined between the objects. We 
discuss this process of constructing a partition hierarchy by the fitting 
through an Lp-norm (for p = 1,2,or oo) of a second symmetric matrix 
whose entries represent what is called an ultrametric and which can be 
used to induce a partition hierarchy. A dynamic programming strategy, 
and a heuristic extension for larger object sets, is suggested as the com- 
putational mechanism for carrying out the procedure of combinatorial 
search for the ultrametric that is the best-fitting according to the chosen 
[,-norm. A numerical example is used to illustrate the complete fitting 
process that relies on a proximity matrix provided. A final extension is 
presented for the construction of best-fitting ultrametrics based on two- 
mode proximity data defined between distinct object sets. 


Key words: Ultrametric, L,-norm, hierarchical clustering, dynamic pro- 


gramming, partitioning. 
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1 Introduction 


One of the most studied data analysis topics in the field of classifica- 
tion is that of constructing a hierarchical clustering for an object set, 
S = {O1,...,On}, based on some given n x n symmetric proximity matrix 
P = {pj}; an entry pi; (= pji > 0, and pi = 0) is assumed to represent 
the dissimilarity of the objects O; and O;, where larger values correspond 
to the more dissimilar objects. A hierarchical clustering of S can be repre- 
sented by a sequence of partitions, P1, P2,..., Pr, where P; is the (disjoint) 
partition in which each object forms its separate class, Pr is the (conjoint) 
partition containing all objects in S within a single class, and P, is con- 
structed by uniting two or more classes in P;_;. (Most commonly, only 
one pair of classes will be united in P;_1, so that T = n and P; therefore 
includes n — t + 1 classes.) The task of hierarchical clustering is typically 
carried out by a greedy optimization strategy, which begins with Pı and 
successively identifies P, from P—1, for t > 2, by minimizing some chosen 
measure of proximity between the subsets that could be united to form a 
new class in P;. Most commercially available statistical software packages 
(e.g., SYSTAT, SPSS, and SAS) implement their routines for hierarchical 
clustering in this manner and with various choices for how the proximity 
between subsets might be defined. 

The present paper is concerned with this particular problem of con- 
structing a partition hierarchy that is intended to represent the patterning 
of relationships present in the proximity matrix P, but will do so indirectly 
by fitting a second matrix to P = {p,;;}, denoted by U = {u;;}, minimiz- 
ing an Lp-norm (for one of the usual values chosen for p of 1, 2, or oo). 
The entries in the fitted matrix U will satisfy a collection of linear inequal- 
ity/equality constraints, characterizing what is called an ultrametric, that 
in turn can be used to retrieve a specific partition hierarchy for the object 
set S. The fitting task itself will be carried out through a recursive opti- 
mization strategy based on dynamic programming which for small object 
sets can provide globally optimal solutions. Later sections of the paper 
discuss the heuristic use of the same dynamic programming strategy for 
dealing with larger object sets, and an extension of the hierarchical clus- 
tering task for proximity matrices that only contain dissimilarity values 
between the objects from two distinct sets (i.e., two-mode proximity data). 
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2 Ultrametrics 


A concept routinely encountered in formal discussions of hierarchical clus- 
tering is that of an ultrametric, which can be characterized by any non- 
negative n x n symmetric dissimilarity matrix for the objects in S, de- 
noted generically as U = {u;;}, where uw; = 0 if and only if i = j and 
the entries in U satisfy the ultrametric inequality: uz < max{uik, Ujk} 
for 1 < i,j,k < n. An alternative characterization of this last inequality 
would be that for all distinct object triples, O;, Oj, and Ox, the largest 
two dissimilarities among Uij, Uik, and ujk are equal and (therefore) not 
smaller than the third. Any ultrametric identifies a specific partition hier- 
archy, P1,..., Pr, where those object pairs defined between subsets united 
in P;_1 to form P; all have a common ultrametric value that is not smaller 
than those for object pairs defined within these same subsets. Thus, the 
individual partitions in the sequence can be identified by increasing a 
threshold variable from zero and observing that 7; is associated with a 
particular threshold value where all dissimilarities within a class in P 
are less than or equal to this threshold and all dissimilarities between 
the classes in P; are strictly greater. Conversely, the collection of all ul- 
trametric matrices can be decomposed into equivalence classes where all 
members of an equivalence class induce the same partition hierarchy. If 
Pi,.-., Pr denotes the specific partition hierarchy induced by all mem- 
bers of an equivalence class, we will refer to one particular member of this 
class as the base ultrametric defined by U° = {uj?;}, where up; = min{t — 
1 | objects O; and O; appear within the same class in partition P;}. All 
members of an equivalence class can be obtained from the entries for the 
base ultrametric by a strictly monotonic function that maps zero to zero. 
Moreover, since U?’ contains T — 1 distinct positive values, each member 
of this equivalence class will also contain T — 1 distinct positive values, 
where the (t — 1)** largest corresponds to partition P; in the hierarchy and 
is implicitly associated with those object pairs that appear together for the 
first time within a subset in P4. 

An ultrametric matrix is a convenient device for representing in ma- 
trix form the partition hierarchy it induces, and specifically, the integer- 
valued base ultrametric can serve as a direct way for generating the explicit 
set of linear inequality /equality constraints that any ultrametric within an 
equivalence class must satisfy. Thus, one could find a best-fitting ultra- 
metric within an equivalence class by fitting {u,;} to the original proximity 
matrix {pij} through, for example, an L,-norm regression strategy that 
incorporates the linear inequality/equality constraints implied by the base 
ultrametric (e.g., those in Spath, 1991, Chapter 5). It is also possible to 
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use the ultrametric notion more fundamentally as the basic mechanism for 
obtaining a partition hierarchy in the first place. Explicitly, we suggest here 
the development of hierarchical clustering methods by directly attempting 
to find a best-fitting ultrametric for P by optimizing a loss criterion defined 
by an Lp-norm between {pij} and a (to be identified) ultrametric matrix 
{uij}. This usage of an Lp-norm is more general than what has been done 
thus far in the literature; the extant methods that attempt directly to ob- 
tain a best-fitting ultrametric have all adopted a least-squares criterion and 
some auxiliary search strategy for locating an appropriate set of constraints 
to impose (e.g., see Hartigan, 1967; Carroll and Pruzansky, 1980; De Soete, 
1984; Chandon and De Soete, 1984; Hubert and Arabie, 1995). 

To be specific, suppose for a given partition hierarchy, P1,...,Pn (so 
T = n), we let oS and eLA denote the two classes united in P—ı to 
form P;, and specify b:_; to be some appropriate aggregate (or ‘average’) 
value of the proximities for object pairs between Co and Co. Denoting 
the set of proximities between oc! and cl!) as By_i(u,v) = {py; | Ov € 


Cu. O; € ce y, and depending on the Lp-norm chosen, this between- 
subset aggregate value will be variously defined as the median (Lj), the 
mean (L2), or the average of the maximum and the minimum proximities 
(Loo) in the set By_i(u,v). The loss functions based on an Lp-norm used 
to index the adequacy of a given partition hierarchy in producing an ultra- 
metric fitted to P are for the 


L;-norm: 


` ` | pij — bt-1 |, 


t= O1EC l Orea, 


where bġ—1 is the median proximity in the set By_1(u, v); 
Lə-norm: 


F Se (prj — be-1)°, 


2 Q1e0™), Onec™), 


where b;_; is the mean proximity in the set By_1(u, v); 
L-norm: 


n 
max | Pij’ — bt-1 |, 
t=2 Ovec™, One, 
where 0;_1 is the average of the minimum and maximum proximities in the 
set Bıı (u,v). 
For all three Lp-norms, an optimal ultrametric will be one for which 
the order constraint on the between-subset aggregate values holds: bı < 
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b2 < +++ < bp_-1, and the norm is minimized. For such an optimal solu- 
tion, the between-subset aggregate values, b),...,b,—1, define the distinct 
entries in an (optimal) fitted ultrametric. (It might be noted that since 
less than n — 1 distinct values could be identified if some of the between- 
subset aggregate values are tied, the search for an optimal ultrametric can 
assume without loss of generality that T = n, and for t > 2, only two 
classes are united within P;_; to form P;. Also, as a technical convenience, 
we allow the possibility that some of the between-subset aggregate values 
may be identically zero when the proximities for calculating these are all 
zero. Although not technically an ultrametric since zero ultrametric values 
should not correspond to distinct objects, its structure would still satisfy 
the central ultrametric inequality for distinct object triples.) 


3 A dynamic programming strategy for 
identifying (optimal) ultrametrics 


The optimization task of constructing optimal ultrametrics fitted to a given 
proximity matrix P may be fairly easy to state, but the problem itself is a 
computationally very difficult one to solve. For both the Lı- and Lo-norm, 
for instance, the task has been shown to fall into the class of NP-hard 
problems (see Křivánek and Morávek, 1986; Křivánek, 1986; for a recent 
comprehensive review, see Day, 1996); thus, there is the usual expectation 
that for larger object sets, methods guaranteeing optimality would become 
computationally infeasible to implement. Keeping these computational dif- 
ficulties in mind, along with the eventual necessity of moving to heuristic 
methods of solution for larger object sets, we will still begin with a strategy 
that can fit an optimal ultrametric to P for each of the three Ly-norms in- 
troduced in the last section. ‘The approach suggested is based on dynamic 
programming and the construction of a recursive system that will eventu- 
ally produce an optimal solution. There are some complications that arise 
in the use of a straightforward dynamic programming formulation because 
of the need to impose an order constraint on the successive between-subset 
aggregate values, and these difficulties will be addressed below in some de- 
tail. In addition, a strategy for heuristically extending the basic dynamic 
programming formulation is developed in the next subsection for dealing 
with large(r) object set sizes. 


3.1 Identifying optimal ultrametrics 


To implement a dynamic programming approach for locating an optimal ul- 
trametric, we first define a collection of sets, 01,..., Qn, where Qg contains 
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all partitions of the n objects in S into n k + 1 classes. For convenience, 
a member of Qk is denoted by Ax; thus, Qı contains the single partition 
A, that has n classes in which each of the n objects forms a separate class, 
and Qn contains the single partition A, that includes one class for all of 
the n objects in S. We will say that a transition from Ag_, E€ Qk—1 to 
Ax E Qs is permissible if the union of two classes in Ay,_; produces Az, and 
if an admissibility criterion to be discussed shortly is satisfied (that would 
[hopefully] ensure that the sequence of between-subset aggregate values is 
nondecreasing). A function F(A;,) for Ak € Qk is defined as the optimal 
value for the sum of the contributions for the chosen Lp-norm up to the 
partition Ay. Beginning with F(Ai) = 0 for Ay E 21, we construct F(A,;) 
recursively by 


F (Ap) = min {F(Ag_1) + C(Ak-1, Ax) }, 


where the minimum is taken over all Ap_y € Qk—ı for which a transition is 
permissible to Ag € Qk, and C'(Ax_1, Ax) is the incremental cost of trans- 
forming A,_; to Ax characterized by the appropriate Lp-norm when that 
pair of subsets in A,_; is united to form Ag. (It is this latter independence 
of incremental cost from how A,_; was obtained that is crucial to proving 
the validity of the recursive process.) Finally, an optimal solution is iden- 
tified by F(An) for the single entity An E Qn, and a partition hierarchy 
attaining this optimal value identified by working backwards through the 
recursion starting from Qn and proceeding to Qı and tracing the process 
of how F(An) was generated. 

One unresolved issue needing discussion is the explicit imposition of 
some type of admissibility criterion for defining a permissible transition 
from A,_; to Az that could ensure a nondecreasing sequence of between- 
subset aggregate values. Unfortunately, the validity of the recursive process 
depends on the property that any proposed criterion for admissibility must 
only involve Ay,_; and A, and their relation to the matrix P, and specifi- 
cally not on how Ax_; may have been arrived at. Thus, it is not possible 
to define admissibility directly by requiring the between-subset aggregate 
value that defines Ay from A,_; to be greater than or equal to the last 
between-subset aggregate value that led to A,_; from A,z_2. What can be 
offered, however, are two (less-than-ideal) alternatives: (a) an admissibility 
criterion based only on Ap_, and A, that may sometimes be too lenient and 
thus fail to ensure that the collection of between-subset aggregate values 
are nondecreasing for the (purportedly optimal) identified ultrametric, or 
(b) an admissibility criterion based only on Ag_; and A, that may be too 
strict, and the (purportedly optimal) identified ultrametric could in fact 
not be the absolute best obtainable. 
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To be specific, the possibly too lenient criterion rests on the observation 
(made originally by Chandon, Lemaire, and Pouget, 1980, for the L2-norm) 
that in an optimal ultrametric based on any of the three Lp-norms (with 
the notion of an aggregate value defined by the median, mean, or the av- 
erage of the two extreme proximities), the nondecreasing constraint on the 
between-subset aggregate values, bı <--- < b,_1, requires that b; be both 
greater than or equal to each such aggregate value calculated within a sub- 
set of Ay_1, and less than or equal to the aggregate value of all proximities 
between the subsets in A,. Since these two conditions may be evaluated 
given only Ay_; and Ax, they can be imposed in defining whether a tran- 
sition from Az_; to A, is permissible. Alternately, the possibly too strict 
admissibility criterion would require that b; be less than or equal to any 
between-subset aggregate value calculated for the new subset formed in A; 
and some other subset present in Az. This latter criterion would ensure 
that no (nontrivial) order inversions in the sequence of between-subset ag- 
gregate values would exist (a trivial inversion would be one in which an 
inversion may be present in the collection of between-subset aggregate val- 
ues, but it can be removed by a simple reordering of when two disjoint 
subsets are formed). 

The computer program relied on for the numerical examples in Section 
4 allows the imposition of either of the two admissibility criteria discussed 
above. As a suggested analysis strategy, one would begin with the former 
(and possibly too lenient) admissibility criterion and if no nontrivial order 
inversions in the between-subset aggregate values are found, an optimal 
ultrametric has been identified. If nontrivial order inversions were present, 
the possibly too strict admissibility criterion could be adopted, and the 
then identified ultrametric presumed optimal (but with the caveat that it 
could be possible in some [rare] instances for an even better ultrametric 
to be generated). (For convenience of reference, the program we use is 
referred to by the acronym HPHI, for ‘Heuristic Programming HJerarchical 
clustering’, where the term ‘heuristic’ is included because of the extensions 
it includes for dealing with larger object sets, as discussed in the section to 
follow.) 


3.2 Heuristic extensions for large(r) object sets 


When the number of objects in S is even moderate in size, the random 
access memory storage requirements necessary for a dynamic programming 
approach to constructing an optimal ultrametric can become quite large. 
Necessary for implementing the proposed recursive strategy is the availabil- 
ity of large arrays associated with the sets, 01,...,Qn, that contain for all 
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partitions of S the recursively-constructed values F(A,) for Ak E€ Qk, as 
well as a mechanism for keeping track of what previous partitions in Q,_1 
led to these optimal values F(A;).! For larger object sets, HPHI allows two 
options: (a) finding optimal ultrametrics for subsets of S, and (b) finding 
optimal ultrametrics when the basic objects to be hierarchically partitioned 
are themselves subsets of S. By the judicious and repeated use of these two 
options, we have been able to approach object sets with reasonably large 
sizes (and will do so for an object set of size 30 in the next section). 

The analysis strategy we suggest begins by identifying [possibly through 
a heuristic mechanism] a partition of S, say Pe, that is initially forced to 
be induced as part of the best-fitting ultrametric we construct. The classes 
of Pe are first treated as the basic objects on which an ultrametric is to 
be obtained, i.e., we begin with the classes of Pe and complete the iden- 
tification of an optimal ultrametric from this point on. Secondly, each of 
the classes of Pe is then used to obtain a separate optimal ultrametric for 
the objects in that class. When these results are concatenated, an optimal 
ultrametric is identified, subject to the constraint that Pe is part of the par- 
tition hierarchy it induces. Obviously, if Pe is chosen appropriately to begin 
with, the concatenated results would be optimal for the complete object 
set S. A check on the choice of Pe (however it was obtained initially) can 
be carried out by using object classes identified within the subsets defining 
Pe as the basic units on which an optimal ultrametric is to be constructed 
and then completing the fitting from this point on. If Pe is retrieved as 
part of this latter process, some obviously increased confidence is obtained 
that the concatenated ultrametric may be the best we can find. If, on the 
other hand, Pe is not retrieved, we could then repeat this same strategy 
with whatever partition was observed (presumably for the same number of 
classes as contained in Pe). This whole process could be carried out iter- 
atively until convergence. Obviously, an absolute guarantee of optimality 
is not possible through this type of heuristic search, but the eventual sta- 
bility achieved leads to an ultrametric that is usually very good (although 
not verifiably optimal). Throughout this discussion it is assumed that the 
subsets of objects for which separate optimal ultrametrics are generated, or 
the number of object classes to be used in obtaining an optimal ultrametric 
beginning from that point, are all of a size that could be handled optimally 
(i.e., some number in the lower teen’s). 


1Given the usual Pentium-level processors now commonly available and the amount 
of memory these systems typically contain, the program we have developed can deal 
(optimally) with object set sizes in the lower teen’s, but even this requires the capability 
of Fortran90 to allocate very large arrays dynamically (and inform the user whether 
sufficient memory exists on the system to solve the problem of the size being requested). 
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4 A numerical illustration 


To illustrate the construction of best-fitting ultrametrics based on the Lp- 
norm for a given proximity matrix, we use a data set originally collected 
by Arabie and Rips (1973) for a replication of a study initially conducted 
by Henley (1969) involving the subjectively-judged similarity of 30 ani- 
mals. Fifty-three subjects assessed the similarity between all 435 animal 
pairs based on a scale from 1 (extremely dissimilar) to 10 (extremely simi- 
lar). Table 1 provides the animal names and the summed ratings over the 
subjects subtracted from the maximum of 530 so the proximities would be 
keyed as dissimilarities. (We provide these data in Table 1 as a convenience 
to others who may wish to use this proximity matrix in their own method- 
ological examples. Although these data have been analyzed elsewhere (see 
e.g., De Soete and Carroll, 1996), they have not been published explicitly.) 

Based on the data of Table 1, the results are presented below for each 
of the three Lp-norms using the heuristic process of Section 3.1 for find- 
ing best-fitting ultrametrics. Specifically, a five-class partition, Pe, of the 
object set S' was first identified heuristically (the greedy complete-link hi- 
erarchical clustering method was used up to the level of five classes). An 
(optimal) ultrametric was then found for each of the five classes within Pe, 
and based on these separate ultrametrics, a collection of (smaller) object 
subsets identified and treated as the starting point from which to finish 
the identification of an ultrametric for the complete object set S. Based on 
this latter ultrametric, the object classes for the induced five-class partition 
were then considered as defining an initial partition, Pe, and the whole pro- 
cedure repeated. For all three Lp-norms, the latter five-class partitions were 
retrieved immediately. In all of these analyses, and as suggested in the last 
section, the admissibility criterion that may at times be too lenient (to en- 
sure a strictly nondecreasing between-subset collection of aggregate values) 
was first used, and when nontrivial order inversions were observed (as they 
were for a few of the analyses carried out), the more strict admissibility 
condition was then adopted. 

The results for both the Lı- and L2-norm are very similar, and the same 
five-class partition was induced for the corresponding ultrametrics: 


A: {bear (2), cat (5), dog (10), fox (13), leopard (18), lion (19), tiger 
(28), wolf (29)} — carnivorous feline/canine animals plus the omnivorous 
bear 

B: {beaver (3), chipmunk (7), mouse (21), rabbit (23), raccoon (24), rat 
(25), squirrel (27)} — small rodent-like animals 

C: {antelope (1), camel (4), cow (8), deer (9), donkey (11), elephant 
(12), giraffe (14), goat (15), horse (17), sheep (26), zebra (30)} — large 
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hoofed herbivores (ungulates) 
D: {chimpanzee (6), gorilla (16), monkey (20)} — primates 
E: {pig (22)} — Suidae 


The (optimal) ultrametrics defined by the values of the between-subset 
aggregate values for the Lı- and L2-norm constructed within each of the 
classes labeled above as A, B, C, and D are given below (we also present 
those for the L,.-norm in the case of the two classes labeled B and D that 
were also observed in the retrieved ultrametric using this latter norm). 
Within each class we also provide a summary measure of the discrepancy 
between the proximities and fitted values by giving the contribution each 


class has to the overall Lp-norm measure being minimized. 


level 


A:8 


A:7 


A:6 
ASS 
ASA 
A:3 
A:2 
A:1 
B:7 
B:6 
B:6 


B:5 
B:5 


B:4 
B:3 


new class formed 


{2, 5, 10, 13, 18, 19, 28, 29} 


an addition of omnivorous bear to the 
carnivorous feline plus canine classes 


{5, 10, 13, 18, 19, 28, 29} 


the union of the carnivorous feline and 


canine classes 
{10, 13, 29} 
canines 

{5, 18, 19, 28} 
felines 

{13, 29} 
nondomestic canines 
{18, 19, 28} 
nondomestic felines 
{18, 28} 
feline(subclass) 

(all separate) 


contribution to the norm measures: 


{3, 7, 21, 23, 24, 25, 27} 
{3, 7, 23, 24, 27} 

{3, 23, 24} 

somewhat larger animals 
13; 7, 24, 27} 

{7, 21, 25, 27} 

very small animals 

{3, 24} 

{7, 27} 


217.0 


173.5 


123.0 
22.0 
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Lo 
278.0 


222.6 


123.0 
22.0 


Leo 


206.5 


200.5 


164.5 


123.0 
22.0 
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long-bushy-tail animals 
B:2 {21, 25} 6.0 6.0 6.0 
long-naked-tail animals 
B:1 (all separate) — 
contribution to the norm measures: 458.0 18,500. 79.5 


C:11 {1, 4, 8, 9, 11, 12, 14, 15, 17, 26, 30} 286.0 298.2 
the final addition of elephant 

C:10 {1, 4, 8, 9, 11, 14, 15, 17, 26, 30} 238.0 242.7 

C:9 {1, 4, 9, 11, 14, 17, 30} 212.5 215.9 

C:8 {8, 15, 26} 200.5 200.5 
farm animals 

Cy {4, 14} 174.0 174.0 
African animals 

C:6 {1, 9, 11, 17, 30} 167.5 174.7 
horse-like animals 

C:5 {15, 26} 93.0 93.0 
farm animals (subclass) 

C:4 {11, 17, 30} 81.5 81.5 
equine 

C:3 {1, 9} 49.0 49.0 
deer-like animals 

C:2 {11, 17} 31.0 31.0 


domestic equine 
C:1 (all separate) — — 
contribution to the norm measures: 149.6 298,200. 


D:3 {6, 16, 20} 49.5 49.5 49.5 

D:2 {6, 20} 26.0 26.0 26.0 

D:1 (all separate) = — 
contribution to the norm measures: 19.0 200.0 9.5 


Based on the five-classes, A, B, C, D, and E, the completions of a best- 
fitting ultrametric beginning from this point are given below for the Lı- 
and Lə-norm: 


level new class formed Lı Lo 

5 {A, B,C, D, E} 389.0 382.1 
4 {A, C, D} 375.0 

4 {A, C, D, B} 373.1 
3 {E, B} 354.0 

3 {A, C, E} 307.0 


468 Lawrence Hubert, Phipps Arabie and Jacqueline Meulman 


2 {A, C} 323.0 323.2 
1 (all separate) — — 
contribution to the norm measures: 10,494. 598,600. 


For the Lgo-norm, the five-class partition retrieved for the corresponding 
ultrametric differed slightly from that for the Lı- and L2-norm and involved 
the placement of bear (2), elephant (12), and pig (22). Explicitly, the two 
classes previously labeled as B and D were again retrieved for the L,o-norm, 
but the three other classes varied slightly: 


F: {antelope (1), camel (4), cow (8), deer (9), donkey (11), giraffe (14), 
goat (15), horse (17), pig (22), sheep (26), zebra (30)} — large hoofed 
herbivores including (appropriately) pig and excluding bear 

G: {cat (5), dog (10), fox (13), leopard (18), lion (19), tiger (28), wolf 
(29)} — felines/canines only (excluding bear) 

H: {bear (2), elephant (12)} — large animals 

B: {beaver (3), chipmunk (7), mouse (21), rabbit (23), raccoon (24), rat 
(25), squirrel (27)} — small rodent-like animals 

D: {chimpanzee (6), gorilla (16), monkey (20)} — primates 


Using these latter five classes, the completion of a best-fitting ultrametric 
is given below for the L,.o-norm; subsequently, the optimal ultrametrics 
within the classes labeled F, G, and H are given (those for the two classes 
B and D were provided previously along with the L1- and L2-norm results): 


level new class formed Dog 
5 {B, D, F, G, H} 343.5 
4 {D, F, G, H} 335.0 
3 {F, G, H} 331.0 
2 {F, H} 313.5 
1 (all separate) -— 
contribution to the norm measure: 362.0 
F:11 {1, 4, 8, 9, 11, 14, 15, 17, 22, 26, 30} 294.5 
F:10 {8, 15, 22, 26} 272.5 
F:9 {1, 4, 9, 11, 14, 17, 30} 216.0 
F:8 {8, 15, 26} 200.5 
F:7 {1, 9, 11, 17, 30} 178.5 
F:6 {4, 14} 174.0 
F:5 {15, 26} 93.0 
F:4 {11, 17, 30} 81.5 
F:3 {1, 9} 49.0 


F:2 {11, 17} 31.0 
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F:1 (all separate) — 


contribution to the norm measure: 292.5 
G:7 {5, 10, 13, 18, 19, 28, 29} 232.0 
G:6 {10, 13, 29} 104.5 
G:5 {5, 18, 19, 28} 63.0 
G:4 {13,29} 57.0 
G:3 {18, 19, 28} 35.0 
G:2 {18, 28} 24.0 
G:1 (all separate) — 

contribution to the norm measure: 91.0 
H:2 {2, 12} 305.0 
H:1 (all separate) — 

contribution to the norm measure: 0.0 


5 Constructing (optimal) ultrametrics for 
two-mode proximity data 


The discussion of finding optimal ultrametrics has been restricted thus far 
to a single object set S for which a symmetric n x n dissimilarity matrix 
P is available. A direct extension is possible, however, to the context of 
a (two-mode) n4 Xx ng dissimilarity matrix Q = {qi;} defined between 
the objects from two distinct sets, say S4 = {Or,,..-,Or,,} and Sp = 
{Oc,,-++,Ocn, J, containing na and ng objects respectively, and where 
qij denotes a dissimilarity between the (row) object O,, and the (column) 
object O-,. Specifically, a combined single object set S is first constructed 
as S = SA U Sp containing n = na + np objects, and the same dynamic 
programming strategy for locating an (optimal) ultrametric is now applied 
to the single set S but with two modifications: (i) when considering the 
recursive process over the sets Q1,..., Qn, a transition from Ag_y E OQx-1 
to Ak € Qk is not permissible whenever the new subset formed in A, would 
contain only objects from S4 or from Spg; (ii) in generating the between- 
subset aggregate values and the contribution to the chosen norm measure 
for a transition from Aķ—1ı to Ak, only those proximities defined between 

the object sets S4 and Sg are considered. Based on this strategy, the 
between-subset aggregate values producing the fitted values for the prox- 
imities in Q, denoted generically as T = {t;;}, will satisfy the two-set 
ultrametric inequality (e.g., see Furnas, 1980; De Soete, DeSarbo, Furnas, 
and Carroll, 1984a, 1984b): for O,;,O,, E€ Sa, and Oc; Oc; E€ Sp, the 


largest two values among tricj» tric; trc; and trc; are equal. 
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animal 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
name 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 
antelope (1) SE E i E 
bear (2) 396 * * * k k k k k k KKK č * 
beaver (3) 378 333 * * x * x x * x * x * * x 
camel (4) 225 334 410 * * * * * * # * k * KF 
cat (5) 340 341266384 * * * * * * * * * * # 
chimpanzee (6) 380 366 354 403 355 * * * * * * * * x x 
chipmunk (7) 387 389 176 434 273 362 * * * * * * * * * 
cow (8) 240 290 395 260 376 406 403 * * * * * * * # 
deer (9) 49 315 359 249 339 392 357238 * * * * * * #* 
dog (10) 316 282 305 329 221 337 327304280 * * * * * * 
donkey (11) 206 341 385 199 365 377 404 216 228 316 * * * * * 
elephant (12) 301 305 416 256 420 383 469 258 342 382285 * * * * 
fox (13) 315 289 286 377 224 360 325 357 289 126 347 407 * * * 
giraffe (14) 203 356 418 174 422 372 429 313 236 370 290 266 380 * * 
goat (15) 179 338 361 257 327 381 371 193 196 240 204 363 313 310 * 
gorilla (16) 379 251 400 378 411 40 412 397 407 369 382 347 362 347 412 
horse (17) 157 304 389 142 371 391 423 181 150 267 31 274 320 190 209 


374 x * x * * x * x x * x x x * 


leopard (18) 268 288 390 341 52 349 419 355 281 262 326 339 213 323 332 


335 285 * * * * x x * * x * x * * 


lion (19) 276 257 381 327 75 365 oe 322 Bi ts 323 ae 213 a 325 
323 247 38 * * * * $ $ f 

monkey (20) 383 358 352 403 359 26 310 419 n: n aa ss sa 388 ote 
59 395 363 369 * * * * ‘i 


mouse (21) 434 436 261 439 336 388 164 416 386 349 412 473 372 453 386 
472 440 430 457 385 * * * k*k * kxk * x * * 
pig (22) 410 356 350 395 359 408 394 269 368 299 347 371 362 410 284 
404 344 400 389 401 375 * * * * * * x * * 
rabbit (23) 321 394 207 407 271 378 201 400 323 297 390 435 301 420 340 
430 394 391 403 360 222343 * * * * x x * * 
raccoon (24) 356 304 123 391 229 347 171 405 349 282 383 433 214 398 
397 383 356 371 307 248 344194 * * * * * * 
rat (25) 422 406 245 440 295 401 155 431 405 353 421 465 358 448 382 
452 431 427 429 398 6 354239270 * * * * * *#* 
sheep (26) 233 335 355 296 314 390 384 208 230 263 239 350 333 341 93 
408 247 356 337 394 397 261 317 335 395 * * * * #* 
squirrel (27) 368 378 183 422 264 347 22 413 366 322 401 454 312 439 389 
438 413 410 409 313 161 385 188 143 174 364 * * * * 


tiger (28) 281 243 403 328 57 368 431 355 295 287 320 318 205 333 316 
320 297 24 32 354 445 415 412 371 430 348 415 * * * 
wolf (29) 301 246 338 349 245 366 397 345 312 83 317 374 57 362 303 


339 279 180 177 382 417 383 367 292 374 310 377 181 * * 
zebra (30) 129 319 396 214 347 375 416 228 178 293 116 287 307 211 222 
367 47 244 252 378 437 370 384 377 431 258 413 240 290 * 


Table 1: A lower-triangular dissimilarity matrix between thirty animals 
based on data collected by Arabie and Rips (1973). 
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Abstract: Clustering is considered usually an art rather than a science 
because of lacking comprehensive mathematical theories in the discipline. 
The major issue raised in this paper is that Lz and L; approximation bi- 
linear clustering can provide a theoretical framework for an extensive part 
of partitioning and hierarchic clustering concerning its algorithmical and 
interpretational aspects, which is supported with a theoretical evidence. 
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1 Introduction 


Clustering is considered usually an art rather than a science because of 
lacking comprehensive mathematical theories in the discipline. The ma- 
jor issue raised in this paper is that approximation bilinear clustering can 
provide a theoretical framework for a part of partitioning and hierarchic 
clustering concerning its algorithmical and interpretational aspects. ‘Two 
approximation norms, Lı and L2, are considered and compared. 

The remainder consists of two parts devoted respectively to partitioning 
(Sections 2 and 3) and hierarchic clustering (Section 4), and a conclusion 
(Section 5). In Section 2, a bilinear model relating data to a partition 1s 
considered. The model is introduced in Section 2.1 where two model-based 
principles for data standardization are suggested. In Section 2.2., an Lo 
decomposition of the data scatter into explained and unexplained parts is 
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discussed, especially in its relation to the nominal data case. It appears, 
some most known contingency measures, as Pearson chi-square, can be 
interpreted as contributions to the data scatter. Similar work is done for 
L in Section 2.3. 

In Section 3, clustering algorithms are discussed for both, Lı and La, 
criteria. In Section 3.1, K-Means and principal cluster analysis methods 
are considered as locally optimal approximation techniques (the latter can 
be applied also for finding overlapping clusters). In Section 3.2, a prereq- 
uisite for this is outlined: interrelation between six otherwise independent 
parameters of cluster structure, emerging in the context of the bilinear 
model. 

In Section 4, hierarchic clustering is put in the bilinear modeling frame- 
work. In Section 4.1, 3-valued nest indicator functions are introduced to 
provide for exact embedding of binary hierarchies into linear subspaces. 
The case of Lə hierarchic clustering is considered in Section 4.2, which 
is proved similar to the case of Lo partitioning except for that here de- 
composition concerns not only the data scatter but also the data entries 
and between-variable correlations. The case of Lı hierarchic clustering is 
treated in Section 4.3. Due to the fact that the split cluster centers must 
be interrelated here, the alternating minimization technique produces a 
modified clustering approach. 


2 Bilinear partition model and scatter 
decomposition 


2.1 Bilinear model and standardization of mixed data 


Let us consider an entity-to-variable data table in which a quantitative 
variable k is represented by a quantitative N-dimensional column-vector £k 
of its values on N entities under consideration. A binary variable (category) 
k is basically a question admitting only answers Yes or No for each of the 
entities; the values are coded 1 (Yes) and 0 (No), which produces a zero-one 
N-dimensional column vector xg. A nominal variable k is coded by a zero- 
one N x #k submatrix £k = (Xiy) where #k is the number of categories 
v € k and Ti, equals 1 when entity i belongs to category v of k, and 0 
otherwise. 

Encoded this way, the data matrix will be denoted by X = (xj) where 
i € I are entities and v € V are variables/categories corresponding to 
columns. These data are preprocessed into matrix Y = (Yiv) by the stan- 
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dard preliminary transformation (standardization) so that 


Tiv — Ay 


Yiv = b, 


„tel, veV (1) 


assuming change of both the scale factor (dividing by b) and the origin 
(adding of a) in the original column zy. Choice of a and b as well as their 
meaning for categories will be discussed below after introducing the bilinear 
clustering model. 

Let the entities be assigned into groups (clusters) presented by an ad- 
ditive type cluster structure which is a set of m clusters, any cluster t, 
t = l,...,m, being defined with two objects: 1) its membership func- 
tion z = (zi),i € I, where zi is 0 or 1 characterizing thus a cluster 
set S; = {i € I : za = 1}, 2) its standard point, or centroid vector, 
Ct = (Cty),v E V, to be combined in an N x |V| cluster-type matrix with 
elements J] Ctuzit- (|V| is the number of columns in X.) 

The cluster-type matrix models the given matrix Y via equations 


m 
Yiv = S Cua + eiv (2) 
t=1 


where residual values e;, show difference between the data and the clusters. 
When clusters are not given a priori, they can be found in such a way 
that the residuals are made as small as possible, thus minimizing ®({|e;,|}) 
where ® is an increasing monotone function of its arguments. The equations 
in (2) along with criterion © to be minimized by unknown parameters, 
Cty, Zit, Civ, for Yiv given, will be referred to as the bilinear clustering model. 
This model was suggested by the author as an extension of a version of the 
principal component analysis technique in Mirkin (1987) and updated in 
Mirkin (1990). It was considered also in Chaturvedi and Carroll (1994). A 
detailed account of the model and its use in hard and fuzzy clustering and 
machine learning can be found in Mirkin (1996). 

Though the model is quite similar to that of the principal component 
analysis (the only difference is that the “components” z are Boolean, not 
arbitrary, vectors), it has a meaning on its own, just as a clustering model. 
When the clusters are required to be nonoverlapping, the type-cluster ma- 
trix $i] CtyZit has especially simple structure considered also by Van Bu- 
uren and Heiser (1989): its rows are the vectors c = (Cj) so that every 
i-th row equals c: for that specific cluster t which contains the entity 2 € I. 

Two Minkowski forms of criterion ® for minimizing the residuals are 


Lo = ier Juey € and Li = Vier Vvev |eiv|. With the non-overlapping 
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restriction, the criteria become especially simple: 


m 

Lp = ` Wie lYiv a S Caul = 5S ` » (Yiv — Ctv |P (3) 
icI veV t VEV t=1iES; 

which shows that Lp, actually, is Lp = #21 Dies, P (yi, c) where d? is the 

p-th power of the Minkowski distance. 

Due to formula (3), when the membership functions are given, the 
optimal cy, is determined only by the values y;, within S. In partic- 
ular, the least-squares (p = 2) optimal Ct is the average of Yiv in St, 
Cto = Pies, Yiv/|St|, while the least-moduli (p=1) optimal Cty is a median 
of Yiv,t E St. 

The criterion ®, when its argument is the data matrix Y = (yw) itself, 
®({|yiv|}), may be considered as a measure of the scatter of the data while 
®({|eiy|}) as a measure of the “unexplained” scatter. Indeed, their differ- 
ence, ® = ®({|yi.|}) — ®({|e|}) will be nonnegative for any appropriate 
minimizer of ® since eiv = Yiv (for all i,v) and thus ® = 0 when all cy, = 0 
which is not an optimal solution. Value ® can be interpreted as the “ex- 
plained” part of the data scatter ®({|y;,|}), which gives a decomposition 
of the data scatter in the two parts, ®({|yiy|}) = ® + ®({|ezy|}). 

In this setting, it is the data scatter which is decomposed into explained 
and unexplained parts due to the bilinear model; moreover, the unexplained 
part is nothing but the minimized criterion of the model. This is why the 
present author considers the data scatter as the base for choosing the data 
standardization parameters in (1). 

Let us require that all the variables are standardized so that their contri- 
butions to the data scatter are equal to each other. The principle should be 
considered as an adequate formalization of the requirement of equal weight 
of the variables in numerical taxonomy (Sneath and Sokal, 1973). The 
choice of parameter a, does not affect the model (2) for a non-overlapping 
cluster structure, however when the bilinear model is set forth in a sequen- 
tial way with the “component” axes z; identified one-by-one, not simulta- 
neously, the solution heavily depends on the origin of the variable/category 
space. To adjust to this kind of principal/correspondence analysis meth- 
ods, let us postulate an analogue to the law of minimum moment of inertia 
in mechanics: the origin of the variable space should be a minimizer of the 
data scatter. 

The two scatter-based principles make the parameters defined unani- 
mously for Lı and Lə. When p = 2, they lead to the usual z-score stan- 
dardization rule: the origin is the grand mean while the standard deviation 
is the scale factor, which will be referred to as square-scatter standard- 
ization. When p = 1, the origin must be grand median while the scale 
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factor is the absolute deviation, which will be referred to as module-scatter 
standardization. 

In the case of mixed data, the average of a category v € V column vector 
is equal, obviously, to the relative frequency of the category in I, py, while 
its median may be 1, 1/2, or 0 depending on whether p, is larger than, 
equal to, or smaller than 1/2, respectively. To satisfy the principle of equal 
contribution with a, = py, the L2-based scale factor of a category v can be 
taken as by = y1 — }`, p2 where summation is made by all the categories 
of a variable k, v € k (the square root of Gini index). There can be also 
other standardizing options suggested as, for instance, by = y (#k — 1)py 
which is category-specific. 

The absolute deviation of the values of a binary column vector from the 
median is equal to p, or 1 — py depending on whether pẹ, is less than 1/2 
or not. 


2.2 Decomposition of the least-squares criterion 


With the least-squares criterion, the following decomposition holds (see, for 
instance, Jain and Dubes, 1988). 


Statement 1 Jf values czy are optimal for a partition S = {St} of I, then 


DOD -AEDO DANES DODA (4) 


iEl vEV t=1 vEV iE vEV 


Usually the equation in (4) is interpreted in terms of analysis of vari- 
ance. In cluster analysis, interpretation of (4) in terms of the contributions 
to data scatter seems more helpful. The contribution of a pair variable- 
cluster (v,t) to the explained part of the data scatter is c?,|S;|: it is pro- 
portional to the cluster cardinality and to the squared distance from the 
grand mean of the variable to its mean (standard value) within the cluster. 
The contribution of an entity-cluster pair can be evaluated as (yi, cz) be- 
cause c?,|St| = (ies, Yiv/|St|) cto | St] = Dies, Yivctv. These cluster-specific 
salience weights of the variables and entities can be employed for concept 
learning and feature selection in machine learning (Mirkin, 1997). 

To analyze the contributions of nominal variables and their categories 
to the scatter part explained via cluster partition S, let us denote the 
relative frequency (proportion of ones) of category v in set I by p, and 
the proportion of entities simultaneously having category v and belonging 
to cluster S;, by Put. Then, for any category v standardized by formula 
(1), its mean within cluster S; is equal to cry = (Pot — Ptdv)/ (pbx). The 
contribution of a category-cluster pair (v,t) to the explained part of the 
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data scatter is equal to 


s(v, t) = Chyl Sel = N (Pot a Pedy)” /(prb5), (5) 


which can be considered a measure of association between category v and 
cluster t. 

Since every nominal variable k is considered as the set of its categories 
v, the joint contribution of k and the set of the clusters S; to the scatter of 
the data is equal to F(k, S) = >; oye, s(v, t) which is 


= NY Ty Pe Pea) (Dot — P Pitay)” (6) 


2 
t=1 vEk prb 


by (5). Substituting the appropriate values of ay = py and by, we arrive at 
the following. 


Statement 2 For criterion Lo, the contribution of a nominal variable k € 
K to the part of the square scatter of the square standardized data that is 
explained by the (sought or found or expert-given) cluster partition S = 
{ S1,- Sm}, is equal to 


A(S/k) = NYT y Pet Perey (Put — Pope)” Popi)” (7) 


vEk t=1 Pt 


when b, = 1 (no normalization), or 


W(R/k) = NYY Pe Pepe) [Pe (Put — Pupi)” (Pot — Pot) [Pt (8) 


sakii tLe Po 


when by = 4/1 — yc, P2 (a standardizing option suggested), or 
M(S/k) = Trae Şo Put — Poe)” (Pot — Put)” (9) 


as 1 PuPt 
when by = \/pu(#k — 1) (another standardizing option). 


All three of the coefficients relate to well known indices of contingency 
between the nominal variables: M(S/k) is a normalized version of the 
Pearson chi-square coefficient, A(R/k) is proportional to the coefficient 
of reduction of the error of proportional prediction, and W(R/k) is the 
Wallis coefficient. Amazingly, it is the method of data standardization 
which determines which of the coefficients is produced as the contribution- 
to-scatter. 

The contribution of a quantitative variable into the explained part of the 
Lə data scatter is also meaningful. When the variable k is standardized, it is 
exactly Nn*(k, S) where 77(k, S) is the so-called correlation ratio (squared). 
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2.3 Least-moduli decomposition 


Similar decomposition can be done for the least-moduli criterion (the con- 
tents of this section is a corrected version of section 6.1.4 in Mirkin, 1996). 

Let S; be an entity subset, and Sw = {i E St : |yio| < |cry| & sgn Yiv = 
sgn Cty} where, as usual, sgn x is 1 if x > 0, 0 if x = 0, and —1 if 
x <0. This means that Sw = {i E€ St : 0 < Yiv < Cty} if Cty is positive or 
Sty = {i E St : Ctv < Yiv < 0} if cy is negative. Having a value cy fixed, the 
set S; is partitioned into three subsets by the variable/category v depending 
on relations between Yiv, 2 E St, and Cty. For Cty > 0, let us denote the 
cardinalities of the subsets where yj, is larger than, equal to or less than cy 
by Ntvl, Ntv2 and Nty3, respectively. Then, let ney = Ntvl + Nty2 — Ntw3. For 
Cty < 0, the symbols nty1 and niyg are interchanged. If cy is the median of 
values y, in S; and all the values Yiv, 7 E St, are different, then nwi = Ntug 
and Niy = Nty2 = 0 or = 1 depending on the cardinality of S; (even or odd, 
respectively). 


Statement 3 When values Cw are L,-optimal for a partition S = {S+} of 
I, the following decomposition of the module data scatter holds: 


Sic Ses le Sik GO 


i€I vEeV vEV t=1 i€Sty iEl vEV 


The proof is based on the following equation, |a —b| = |a| + |b| — |sgn a+ 
sgn b| min(|a|, |b|), which holds for any real a and b. 

Let us denote the contribution of a variable-cluster pair (v,t) to the 
module scatter in (10) by s(t,v) = 2Nicgs,, lyin] + ntv|Ctv|.- Based on 
this, various relative contribution measures can be defined: (a) variable 
to scatter, w(v) = > 8(t,v)/ Xiv Yl; (b) cluster to scatter, w(t) = 
dy s(t, v)/ Xiv [Yivli (c) variable to cluster, w(v/t) = s(t,v)/ X» s(t, v); 
(d) entity to cluster, w(i/t) = |sgn Yiv + sgn Cry| min(|yin|, |cev|) — [cto]. 

Let us consider the case when v is a category. 


Statement 4 For any category v standardized (with arbitrary a, and by), 
its median, Cty, in cluster S; is equal to —ay/by, (1—2ay)/2b,, or (1—ay)/by 
depending on pty is smaller than, equal to, or greater than 0.5p:, respec- 
tively. The contribution of a category-cluster pair, (v, S+), to the module 
data scatter is equal to 


s(t,v) = N|2put — ptl|ctvl- 


Proof: The formulas for Ci are evident. To derive the formula for s(t, v), 
let us see that S = @ since the values cy, and y;, must have different signs 
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if they are not equal to each other (yj, may have one of two values only 
since v is a category). Then, nwi + rive = Npiy and nws = N (pi — pw) 
when Cry = (1 —a,)/by > 0 where ay, by are the values used in the module- 
scatter standardization rule. Analogously, nwi + Niyg = N(pt — Pw) and 
Nivy3 = Npiy when Cw = -a/b < 0. O 


Putting the module-scatter based a, and by into s(t,v), we have: 


Statement 5 The contribution of a nominal variable k to the absolute 
scatter of the module-scatter standardized data, as explained by the partition 
S = { S1, -.., Sm}, is equal to 


N 2 vt = V 
Ashe St At e 
(v,t)eA, Pe (v, t)EA- Pv w,t)e AR 5 
11 


where A, = {(v,t) : py < 0.5 and Pput/pt > 0.5}, A- = {(v,t) : py > 
0.5 and put/pt < 0.5}, and A= = {(v,t) : py = 0.5}. 


The coefficient A(S/k) takes into account the situations when the pat- 
terns of occurrences of the categories v € k in the clusters t differ from 
those in the entire set J. Such a difference appears when v is frequent in S; 


(p(u/t) > 0.5) and rare in I (p, < 0.5), or, conversely, v is rare in S; and 
frequent in I. 


3 K-Means and bilinear partitioning 


3.1 Principal clustering and K-Means 


Following the standard strategy of sequential extraction of factors in prin- 
cipal component analysis, the clusters can be extracted one by one in (2), 
which constitutes the method of principal cluster analysis (PCL) (applica- 
ble in both overlapping and non-overlapping cluster cases). 


1. Set t = 1 and define data matrix Y; as the initial data matrix stan- 
dardized according to the criterion, Lı or L2, chosen as described in Section 
2.1. Choose whether the clusters to be found are required to be nonover- 
lapping or they may overlap each other. 

2. For Y = Y; find a principal cluster minimizing Lı or Lz as described 
in the algorithm for Single Cluster Clustering (SCC) below. Define zt, ct as 
the cluster solution found (membership function and the standard point, 
respectively); compute its contribution to the data scatter. 

3. Stop-Condition. If there must be nonoverlapping hard clusters, check 
whether there are yet unclustered entities remaining. (In the other case, 


Lı and Lz approximation clustering for mixed data 481 


check the standard contribution-based stopping rule of the principal com- 
ponent analysis). If yes, go to 4; else end. 

4. Compute the residual data ytti = Yf, — Cty2zit- In the nonoverlap- 
ping case, set Y;41 by removing from Y; all the rows corresponding to the 


previously found cluster t. Increase t by 1, and go to 2. 


The PCL algorithm can be rephrased in terms of K-Means method (Mac- 
Queen (1967), Jain and Dubes (1988)), which, in its “parallel” version, 
starts with m somehow selected tentative standard points or “seeds”, cz. 
Then the algorithm repeatedly performs the following two-step iteration : 
(1) update the partition based on the standard points : given c+, make each 
St the set of y; that are nearest to ct, t = 1,...,m; (2) update the standard 
points: when all S; are given, compute c; as the mean (or median, for Lı) 
of the within-cluster vectors. This algorithm is, in fact, a version of the 
alternating minimization for criteria Lp : (1) given c, find optimal z; (2) 
given z, find optimal c. 

The principal cluster analysis can be considered as a technique that 
exploits many of the same mechanisms, but which mitigates the need for 
prior knowledge, and separates clusters from the set of instances one by one. 
First, an initial cluster S; C I is extracted with its standard point cı; the 
complementary set represents the main “body” of entities, which serves as 
the source for separating additional clusters one by one. This is reflected in 
that fact that the main body’s standard point is fixed at 0, given the square 
or module scatter standardization, and it is not changed during the entire 
clustering computation. The algorithm SCC (Single Cluster Clustering) for 
separating a principal cluster at the Step 2 of PCL is as follows (the data 
matrix is denoted by Y, not Y;): 


Step 1. (Selection of an extreme point). Pick a point, y;*, maximizing 
distance d(0,y;), i € I, from the origin (the distance is taken according to 
the criterion, Lı or Ls, selected : city-block or Euclidean squared). Take 
c = yi» as the initial center (seed) of the cluster to be found. 

Step 2. (Updating of the cluster). Define cluster S of points y; around 
the center c as S = {i : d(y;,c) < d(y;,0)}. 

Step 3. (Updating of the standard point). Compute the center of S, 
c = c(S), which is the median or average vector, depending on L or Lə is 
utilized. 

Step 4. (Stop condition) Compare S with that at the previous iteration. 
If there is no difference, the process ends: S and c(S) are the result. Else 
go to Step 2. 


A general property is that the size of an SCC-designed cluster depends 
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on its distance from the origin (which is just a reference point): the nearer 
to that point, the less the diameter of the cluster! Thus, SCC could be 
modified to allow the user to specify the reference-point origin based on 
the user’s knowledge of the variable space: the better the knowledge, the 
smaller the classes. 

The principal cluster analysis method can be used as an option in ex- 
tending the K-Means method for a wider class of situations when the user 
can fix a few or none tentative centers even if she/he does not know the 
total number of the clusters or the total number is larger than the number 
of tentative centers the user is able to specify (Mirkin (1996)). 


3.2 How K-Means parameters should be chosen 


The user of the K-Means method faces, usually, problems in choosing the 
following five important kinds of parameter associated with the method: 
1) preliminary transformation of the raw data X into matrix Y to be pro- 
cessed; 2) entity-to-center distance d(x, c); 3) centroid concept; 4) number 
of clusters; 5) initial centers. Traditionally, these parameters are considered 


Criterion | Data Metric Centroid | Scale Shift 
Oe cite [|S | Parameter | Parameter 
Least Square Euclidean | Average | Standard | Average 
Least Absolute | City-Block | Median Absolute | Median 
Medi [Vale [on | | Deviation [on 
Least Maximal | Chebyshev | Midrange | Half-range | Midrange 
Merion [Range a a 


Table 1: Correspondence between clustering parameters due to the bilinear 
model. 


as completely independent except for the obvious equality of the numbers 
of clusters and centers. The bilinear clustering model suggests that there 
is no independence anymore: the parameters are associated to the crite- 
rion for model fitting. The correspondence is presented in Table 1; The 
Chebyshev least-maximum criterion, Loo = Maxi» [Yiv — Jt CtvZit|, also is 
included since all these parameters can be derived from it, too. 

Table 1 can be used for determining all of the six parameters when the 
user is able to choose at least one of them. If, for instance, the user prefers 
medians as the centroids, she/he is restricted, due to the bilinear model, 
with the least-moduli criterion along with the city-block distance, etc. 
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4 Bilinear hierarchic clustering 


4.1 Linear embedding binary hierarchies 


To discuss hierarchic clustering, we consider a binary hierarchy as a set 
of subsets Sw = {Sy : Sy C I,w € W} called clusters containing all 
singletons and J so that the clusters S,, w € W, are nested and every 
non-singleton cluster Sw, w € W, is a union of its two children clusters 
Swl, Sw2 E SW. 

For any nonsingleton cluster Sy = Swi U Sw2 (w, w1, w2 € W) of Sw, 
its three-valued nest indicator function dy = (diw) is defined by diy = aw 
if 2 € Swi, = —by if i E Swe, and = O if i ASy, where the values a, and bu 
satisfy the two conditions: (1) vector ¢y is centered; (2) vector’s Øw norm 
is 1. It is easy to see that 

Nw2 


Nw) 
a , and by = 
NwiNw NypQnrw 


(12) 


where ny, Nyw1, and Ny2 are cardinalities of Sw and its two children, Sy 
and Sw2, respectively. 

It turns out, vectors gy are mutually orthogonal, (dy, dy) = 0, which is 
trivial when SwNSw = @ and also true when SwN Sw 4 @ since in the latter 
case one of the clusters is a part of the other and, thus, its components are 
non-zero when the other vector’s components are constant. Therefore, the 
set {dy : w E W} is an ortho-normal basis of the (N — 1)-dimensional 
space of all N-dimensional centered vectors, and any column-centered data 
matrix Y can be decomposed as follows: 


Y = ðC (13) 


where ® = (iw) is the N x (N —1) matrix of the values of the nest indicator 
functions and C = (cw) is an (N — 1) x |V| matrix. 

Since TỌ is the identity matrix, multiplying equality in (13) by 67 
leads to C = TY , that is, 


Nuin Main 
Cwk = a T J Yw2v) = ——— (Yuriy = Va (14) 
Ny Nw2 


where Ywv, Ywiv and yyy are the averages of the variable/category v € V 
in Sw, Swi and Sw2, respectively. By analogy with the factor loadings in 
the principal component analysis, the entries of C can be referred to as the 
cluster loadings. 

Let us denote by Yw the m-dimensional vector of the averages of the 
variables in a subset Sw, w € W. The equality in (14) implies that both 
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Lı and Lo norms of vector cy = (Cw) can be expressed as 


Mauin 
Hw = (ywi, Yur) (15) 


where d(x,y) is the city-block/Euclidean distance between vectors x,y. The 
value fly is positive if x Æ y, and zero if x = y. It is an analogue of 
the singular value in decomposition (13) considered as an analogue of the 
singular-value decomposition. 

Another useful property of decomposition (13) is that YTY = CTC, 
which is a decomposition of the between-variable covariance (or correlation) 
coefficients by clusters of the hierarchy Sy. 

Thus, in the case when all the cluster hierarchy is available, there is not 
much difference between Lı and Lə cases, just the choice of distance must 
be done accordingly while the average serves as the center in both cases. 


4.2 Least-squares hierarchy fitting 


When the hierarchy is partly unknown so that only upper clusters are given, 
the exact equality in (13) must be changed for the bilinear model equation, 
in this case Y = C + E, where the residuals E are to be minimized. It 
is not difficult to prove that, when some columns of ® are given (as a part 
of a basis), the least-squares estimator for corresponding C still satisfies 
equation (4.1). Moreover, the data scatter is decomposed as follows: 


m 
T L-E E a, (16) 
iElvEV t=1 zEl,vEeV 
so that finding an optimal m-column ® requires maximizing 77, p?. 

In the framework of the principal component analysis-like sequential 
fitting strategy, splitting are to be done sequentially, starting with the all 
set I, each time maximizing corresponding p2,. It is exactly the criterion 


(Yui, Yw2), (17) 


suggested by Edwards and Cavalli-Sforza (1965) for divisive clustering, to 
be maximized by splitting a cluster S, into Sy; and Sw2. The step of 
taking residual data in the principal component analysis-like strategy can 
be skipped here since it doesn’t affect the results, as is not difficult to prove. 
The standard K-Means method (with two clusters) can be applied as an 
alternating maximization technique since criterion (17) is equivalent to the 
least-squares clustering criterion. 

The other expression in (14) leads to the same criterion expressed as 
uz, = NwyNwid?(ywi,yw)/Nw2 which implies a different, SCC-like algo- 
rithm, because the center Yw of S, does not vary in the splitting process. 


2 _ ThwiNw2 2 
Hw = DER. 
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4.3 Least-moduli hierarchic clustering 


The least moduli estimator of C in the bilinear equation Y = 6C + E 
(with E minimized) does not satisfy (14) anymore. Let us discuss the 
sequential fitting strategy with the criterion, Ly = Vie; yey lYiv — Coil, 
to be minimized by unknown c, ¢; (index w is omitted, for notational 
simplicity). Let us denote c = (c,) and apply definition of ¢; the criterion 


becomes: 
T1(S1,S2,c) = $ dlyi, d) + X d(yi, c”) (18) 
iES1 1€S2 
where d is the city-block distance and c! = (ngq/nn;)1/2c, e" = —(n1/nng)!/2c. 
The latter equations give nı = —ngc’, which makes the alternating min- 


imization algorithm for criterion (18) different of the standard K-Means. 
More precisely, one step, updating of the clusters, is performed exactly as 
in K-Means: just collecting the entities around centers, c’ and c”, to give 
Sı and S2, respectively. The other, updating of the center, is somewhat 
more subtle and based on an equivalent form of (18): 


(nnyn2)/?7L, = > nod(n1 yi, C) + ` niıd(—Nnyi, C) (19) 
1€S} 1E€S2 


where € = (njno/n)'/c. The center updated has c, equal to (n/njn2)!/2é, 


where ¢, is the weighted median in the set of reals consisting of n1y;, (for 
i E€ S1, weight n2) and —n2yiy (for i € S2, weight nı), for every v E€ V. 
The weighted median for a set of decreasing x1,...xy having pı,..., pN as 
their respective weights is defined as the value £a (a = 1,..., N) for which 
Xi<caPi = Ysa Pi} or, when this equality cannot be achieved, it is an 
intermediate between those rg and £a+ı for which the best approximation 
of the weight equality is reached. 
Regretfully, the Lı criterion is irrelevant with regard to nominal data. 


Statement 6 The value of Lı criterion (18) does not depend on the module- 
scatter standardized nominal variables. 


Proof: Indeed, in a typical case when p, Æ 0.5, the weighted median of 
the column y is zero, which makes the corresponding components in c, Cc’, 
c” in (18) be zero thus giving the same contribution in either of terms of 
the criterion. O 
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EPILOGUE 


Alas, that Spring should vanish with the Rose 

That Youth’s sweet-scented Manuscript should close 
The Nightingale that in the Branches sang, 

Ah, whence, and whither flown again, who knows. 


Khayyam Naishapuri: Persian Astronomer and Poet (1048-1131) 
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He looked into the water and saw that it was made up of a thousand thou- 
sand thousand and one different currents, each one a different colour, weav- 
ing in and out of one another like a liquid tapestry of breathtaking complez- 
ity; and Iff explained that these were the Streams of Story, that each coloured 
strand represented and contained a single tale. 


(Salman Rushdie, Haroun and the Sea of Stories, 1990) 


1 The first ten years 


A good case can be made that modern robustness begins in 1960, with 
the papers by J.W. Tukey on sampling from contaminated distributions, 
and by F.J. Anscombe on the rejection of outliers. Tukey’s paper drew 
attention to the dramatic effects of seemingly negligible deviations from the 
model, and it made effective use of asymptotics in combination with the 
gross error model. Anscombe introduced a seminal insurance idea: sacrifice 
some performance at the model in order to insure against ill effects caused 
by deviations from it. Most of the basic ideas, concepts and methods of 
robustness were invented in quick succession during the following years and 


‘First published in Student (1995), Vol.1, No.2, 75-86. Reproduced by permission of 
the Presses Académiques Neuchatel, Switzerland. 
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were in place by about 1970. 

In 1964 I recast Tukey’s general setup into an asymptotic minimax 
framework and was able to solve it. Important points were: the insistence 
on finite, but small deviations, the formal recognition that a large error 
occurring with low probability is a small deviation, and the switch from 
the then prevalent criterion of relative efficiency to absolute performance. 
At the same time, I introduced the notion of maximum likelihood type or 
M-estimators. 

Hampel (1968) added the formal definition of qualitative robustness 
(continuity under a suitable weak topology), infinitesimal robustness in 
form of the influence function (von Mises derivative of a statistical func- 
tional) and the notion of breakdown point. 

A nagging early worry had been the possible irrelevancy of asymptotic 
approaches: conceptually, 1% gross errors in samples of 1000 are entirely 
different from the same error rate in samples of 5. This worry was laid 
to rest by Huber (1965, 1968): both for tests and for estimates, there are 
qualitatively identical and even quantitatively similar exact finite sample 
minimax robustness results. 

The end of the decade saw the first steps of an extension of asymp- 
totic robustness theory beyond location to more general parametric mod- 
els, namely the introduction of shrinking neighborhoods by C. Huber-Carol 
(1970), as well as the first attempt at studentizing (Huber, 1970). 

The basic methodology for Monte Carlo studies of robust estimators was 
established in Princeton 1970/71, see Andrews et al. (1972). That study 
more or less finished off the problem of robustly estimating a single location 
parameter in samples of size 10 or larger, opening the way for more general 
multi-parameter problems. 

In this paper I shall pick a few of the more interesting robustness ideas 
and follow the strands of their stories to the present. What has happened to 
the early strands since their invention? What important new strands have 
begun in the 1970’s and 1980’s? I shall avoid technicalities and instead give 
a reference to a recent survey or monograph, where feasible. Completeness 
is not intended; for a recent “far from complete” survey of research direc- 
tions in robust statistics, with more than 500 references, see Stahel’s article 
in Stahel and Weisberg (1991, Part II, p. 243). 


2 Influence functions and pseudo-values 


Among the basic robustness concepts, influence functions have become a 
standard tool, especially after the comprehensive treatment by Hampel et 
al. (1986). Also the trick to robustize classical procedures through the 
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use of pseudo-values is becoming common knowledge, even though it has 
received only scant coverage in the literature. This trick is that one calcu- 
lates robust fitted values y; by iteratively applying the classical procedure 
to the pseudo-values y* = J; + rj instead of y;. Here, the pseudo-residual 
rž = %(r;) is obtained by cutting down the current residual r; = y; —G; with 
the help of a function y proportional to the desired influence function (i.e. 
with the #-function defining an M-estimate). For examples see in particu- 
lar Bickel (1976, p. 167), Huber (1979), and Kleiner, Martin and Thomson 
(1979). If w is chosen equal rather than merely proportional to the influ- 
ence function, the classical formulas, when applied to the pseudo-residuals 
rž instead of the residuals, yield asymptotically correct error estimates for 
ANOVA and other purposes (Huber 1981, p. 197). 

There have been some very interesting extensions of influence function 
ideas to time series (Kiinsch, 1984). 


3 Breakdown and outlier detection 


For a long time, the breakdown point had been a step-child of the robustness 
literature. The paper by Donoho and Huber (1983) was specifically written 
to give it more visibility. Recently, I have begun to wonder whether it has 
given it too much, the suddenly fashionable emphasis on high breakdown 
point procedures has become counter-productive. One of the most striking 
examples of the usefulness of the concept can be found in Hampel (1985): 
the combined performance of outlier rejection followed by the sample mean 
as an estimate of location is essentially determined by the breakdown of 
the outlier detection procedure. 


4 Studentizing 


Whenever we have an estimate, we ought to provide an indication of its 
statistical accuracy, say by giving a 95% confidence interval. This is not 
particularly difficult if the number of observations is very large, so that the 
estimate is asymptotically normal with an accurately estimable standard 
error, or also in one-parameter problems without nuisance parameter, where 
the finite sample theory of Huber (1968) can be applied. 

Otherwise, we end up with a tricky problem of studentization. To my 
knowledge, there has not been much progress beyond the admittedly un- 
satisfactory initial paper of Huber (1970). There are not only many open 
questions with regard to this crucially important problem, it is even open 
what questions one should ask! A sketch of the principal issues follows. 

In the classical normal case, it follows from sufficiency of (z, s) and an 


490 Peter J. Huber 


invariance argument that such a confidence interval must take the form 


(T —kys/Jn, T+ kys/V/n) 


with kn depending on the sample size, but not on the sample itself. In a 
well-behaved robust version, a confidence interval might take the analogous 
form 


(T = Kn S/ vn, La K,S/\/n) 


where T is an asymptotically normal robust location estimate and S' is 
the location invariant, scale equivariant, Fisher consistent functional esti- 
mating the asymptotic standard deviation of ,/nT, applied to the empirical 
distribution. In the case of M-estimates for example, we might use 

s(py? = SoS HAF 
(J p/dF’)? 
where the argument of Y, Y’ is y = (x — T)/Sp, i.e. a robustly centered 
and scaled version of the observations, say with Sọ = MAD. If we are 
interested in 95% confidence intervals, Kn must approach ®~1(0.975) = 
1.96 for large n. But K, might depend on the sample configuration in a 
non-trivial, translation and scale invariant way: since we do not have a 
sufficient statistic, we might want to condition on the configuration of the 
sample in an as yet undetermined way. 

While the distribution of ,/nT typically approaches the normal, it will do 
so much faster in the central region than in the tails, and the extreme tails 
will depend rather uncontrollably on details of the unknown distribution of 
the observations. The distribution of S suffers from similar problems, but 
here it is the low end which matters. The question is: what confidence levels 
make sense and are reasonably stable for what sample sizes? For example, 
given a particular level of contamination and a particular estimate, isn = 10 
good enough to derive accurate and robust 99% confidence intervals, or do 
we have to be content with 95% or 90%? I anticipate that such questions 
can (and will) be settled with the help of small sample asymptotics, assisted 
perhaps by configural polysampling (below). 


5 Shrinking neighborhoods 


Direct theoretical attacks on finite neighborhoods work only if the problem 
is location or scale invariant. But for large samples, most point estimation 
problems begin to resemble location problems, so it is possible to derive 
quite general asymptotically robust tests and estimates by letting those 
neighborhoods shrink at a rate n—1/2 with increasing sample size. This idea 


Robustness: Where are we now? 491 


was first exploited by C. Huber-Carol (1970), followed by Rieder, Beran, 
Millar and others. The final word on this approach is contained in Rieder’s 
book (1994). 

The principal motivation clearly is technical: shrinking leads to a man- 
ageable asymptotic theory. But there is also a philosophical justification: 
since the standard goodness-of-fit tests are just able to detect deviations of 
the order O(n™*/?), it makes sense to put the border zone between small 
and large deviations at O(n~'/?). Larger deviations should be taken care 
of by diagnostics and modelling, smaller ones are difficult to detect and 
should be covered (in the insurance sense!) by robustness. 

This does not mean that our data samples are supposed to get cleaner if 
they get larger. But the shrinkage of neighborhood faces us with a dilemma, 
namely a choice between the alternatives: 

e improve the model; or 

e improve the data; or 

e stop sampling. 

Note that adaptive estimation is not among the viable alternatives. The 
problem is not one of reducing statistical variability, but one of avoiding 
bias, and the ancient Emperor-of-China paradox applies (you can get a 
fantastically accurate estimate of the height of the emperor by averaging 
the guesses of 600 million Chinese, most of whom never saw the emperor!). 

The asymptotic theory of shrinking neighborhoods is, in essence, a the- 
ory of infinitesimal robustness and suffers from the same conceptual draw- 
back as approaches based on the influence function: infinitesimal robustness 
(bounded influence) does not automatically imply robustness. The crucial 
point is that in any practical application we have a fixed, finite sample 
size, and we need to know whether we are inside the range of n and e for 
which asymptotic theory yields a decent approximation. This range may 
be difficult to determine, but the breakdown point often is computable and 
may be a useful indicator. 


6 Design 


Robustness casts a shadow on the theory of optimal designs: they lose 
their theoretical optimality very quickly under minor violations of linearity 
(Huber, 1975) or independence assumptions (Bickel and Herzberg, 1979). I 


= am not aware of much current activity in this area, but the lesson is clear: 


“Naive” designs usually are more robust and better than “optimal” designs. 
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7 Regression 


Back in 1975, the discussants of Bickel (1976) raised interesting criticisms, 
in particular there were complaints about the multiplicity of robust proce- 
dures, and about their computational and conceptual complexity. Bickel 
fended them off skillfully and convincingly. 

There may have been reasons for concern then, but the situation has 
become worse. Most of the action in the 1980’s has been on the regression 
front. Here is an incomplete list of robust regression estimators: L1 (going 
back at least to Laplace), M (Huber, 1973), GM (Mallows, 1975), with 
variants by Hampel, Krasker and Welsch, RM (Siegel, 1982), LMS and LTS 
(Rousseeuw, 1984), S (Rousseeuw and Yohai, 1984), MM (Yohai, 1987), 7 
(Yohai and Zamar, 1988), SRC (Simpson, Ruppert and Carroll, 1991), and 
no end is in sight. For an up-to-date review see Davies (1993). 

Bickel would not have an easy job now, much of the “Nordic” criticism, 
unsubstantiated in 1975, seems to be justified now. In any engineering 
product, an overly rapid sequence of updates sometimes is a sign of vigor- 
ous progress, but it can also be a sign of shoddy workmanship, and often 
it is both. In any case, it confuses the customers and hence is counter- 
productive... 

Robustness has been defined as insensitivity to small deviations from an 
idealized model. What is this model in the regression case? The classical 
model goes back to Gauss and assumes that the carrier X (the matrix X 
of the “independent” variables) is error-free. X may be systematic (as in 
designed experiments), or haphazard (as in most undesigned experiments), 
but its rows only rarely can be modelled as being a random sample from a 
specified multivariate model distribution. Statisticians tend to forget that 
the elements of X often are not observed quantities, but are derived from 
some model (cf. the classical non-linear problems of astronomy and geodesy 
giving rise to the method of least squares in the first place). In essence, 
each individual X corresponds to a somewhat different situation and might 
have to be dealt with differently. Thus, multiplicity of procedures may lie 
in the nature of robust regression. Curiously, most of the action seems to 
have been focused through tunnel vision on just one aspect: safeguard at 
any cost against problems caused by gross errors in a random carrier. 

Over the years, I too had to defend the minimax approach to distribu- 
tional robustness on many occasions. The salient points of my defense were 
that the least favorable situation one is safeguarding against, far from being 
unrealistically pessimistic, is more similar to actually observed error distri- 
butions than the normal model, that the performance loss at a true normal 
model is relatively small, that on the other hand the classically optimal 
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procedures may lose sorely if the normal model is just slightly violated, 
and that the hardest problems are not with extreme outliers (which are 
easy to detect and eliminate), but with what happens on the shoulders of 
the distributions. Moreover, the computation of robust M-estimates is easy 
and fast (the last paragraph of this section). Not a single one of these de- 
fense lines can be used with the modern “high breakdown point” regression 
estimates. 

A typical cause for breakdown in regression are gross outliers in X; 
while individual such outliers are trivially easy to spot (with the help of the 
diagonal of the hat matrix), efficient identification of collaborative leverage 
groups is an open, perhaps unsolvable, diagnostic problem. However, I 
would advise against treating leverage groups blindly through robustness, 
they may hide serious design or modeling problems, and there are similar 
problems even with single leverage points. 


The story behind an outlier among the X (“leverage point”) might for 
example be: 


e a misplaced decimal point, 


e an accurate but useless observation, outside of the range of validity of 
the model. 


If the value at this leverage point disagrees with the evidence extrapo- 
lated from the other observations, this may be because: 


e the outlying observation is affected by a gross error (in X or in y), 


e the other observations are affected by small systematic errors (this is 
more often the case than one might think), 


e the model is inaccurate, so the extrapolation fails. 


The existence of several, phenomenologically indistinguishable but con- 
ceptually different situations with different consequences calls for a diag- 
nostic approach (identification of leverage points or groups), followed by 
alternative “what if” analyses. This contrasts sharply with simple location 
estimation, where the observations are exchangeable and a minimax ap- 
proach is quite adequate (although one may want to follow it up with an 
investigation of the causes of grosser errors). 

At the root of the current confusion is that hardly anybody bothers 
about stating all of the issues clearly: not only the estimator and a proce- 
dure for computing it must be specified, but also the situations for which 
it is supposed to be appropriate or inappropriate, and criteria for judg- 
ing estimators and procedures. There has been a tendency to rush into 
print with rash claims and procedures. In particular, what is meant by 
the word breakdown? For many of the newer estimates there are unqual- 
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ified claims that their breakdown point approaches 0.5 in large samples. 
But such claims tacitly exclude designed situations: if the observations are 
equally partitioned among the corners of a simplex in d-space, no estimate 
whatsoever can achieve a breakdown point above 1/(2d + 2). 

It is one thing to design a theoretical algorithm whose purpose is to 
prove, for example, that a breakdown point 1/2 can be attained in principle, 
and quite another thing to design a practical version that can be used not 
merely on small, but also on medium sized regression problems, with a 
2000-by-50 matrix or so. This last requirement would seem to exclude all 
of the recently proposed robust regression estimates. 

Some comments on the much maligned “plain vanilla” regression M- 
estimates are in order. The M-estimate approach is not a panacea (is there 
such a thing in statistics?), but it is easy to understand, practicable, and 
considerably safer than classical least squares. While I had had regression 
problems in mind from the very beginning (cf. Huber 1964, p.74), I had not 
dared to go into print until 1973, when I believed to understand the asymp- 
totic behavior of M-estimates of regression. A necessary regularity condi- 
tion for consistency and asymptotic normality of the fitted values is that 
the maximum diagonal element h of the hat matrix H = X(X?X)71 XT 
tends to 0. While watching out for large values of h does not enforce a 
high breakdown point, it at least may prevent a typical cause of break- 
down. Moreover, with M-estimates of regression, the computing effort for 
large matrices typically is less than twice what is needed for calculating the 
ordinary least squares solution. Both calculations are dominated by the 
computation of a QR or SVD decomposition of the X matrix, which takes 
O(np?) operations for an (n,p)-matrix. Since the result of that decompo- 
sition can be re-used, the iterative computation of the M-estimate, using 
pseudo-values, takes O(np) per iteration with fewer than 10 iterations on 
average. 


8 Multivariate problems 


Classically, “regression” problems with errors in the independent variables 
are solved by fitting a least squares hyperplane, that is, by solving a prin- 
cipal component problem and picking the plane belonging to the smallest 
eigenvalue. It can be argued by analogy that regression problems with po- 
tential gross errors in the carrier should be attacked through some version 
of robust principal components. Thus, multivariate problems, in particular 
of the principal component type, deserve added attention. 

In 1976, Maronna showed that all M-estimates of location and scatter in 
d dimensions have a breakdown point e* < 1/(d+1). In higher dimensions 
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this is shockingly low, but then Stahel (1981) and Donoho (1982) indepen- 
dently found estimates based on projection pursuit ideas with a breakdown 
point approaching 1/2 in large samples. The bad news is that with all cur- 
rently known algorithms the effort for computing those estimates increases 
exponentially with d. We might say they break down by failing to give a 
timely answer! 


9 Some persistent misunderstandings 


Robust methods have persistently been misclassified and pooled with non- 
parametric and distribution-free methods. They are part of traditional 
parametric statistics, the only difference is that they do not assume that 
the parametric model is exactly true. 

In part the robustness community itself is to blame for this misunder- 
standing. In the mid-1970’s adaptive estimates - attempting to achieve 
asymptotic efficiency at all well-behaved error distributions - were thought 
by many to be the ultimate robust estimates. Then Klaassen (1980) proved 
a disturbing result on the lack of stability of adaptive estimates. My cur- 
rent conjecture is that an estimate cannot be simultaneously adaptive in a 
neighborhood of the model and qualitatively robust at the model. 

Also the currently fashionable (over-)emphasis of high breakdown points 
transmits a wrong signal. A high breakdown point is nice to have, if it comes 
for free, but the potential presence of high contamination usually indicates 
a mixture model and calls for diagnostics. A thoughtless application of 
robust procedures might only hide the underlying problem. 

There seems to be some confusion between the respective roles of diag- 
nostics and robustness. The purpose of robustness is to safeguard against 
deviations from the assumptions, in particular against those that are near 
or below the limits of detectability. The purpose of diagnostics is to find 
and identify deviations from the assumptions. 

The term “robust” was coined by Bayesians (Box and Andersen, 1955). 
It is puzzling that Bayesian statistics never managed to assimilate the mod- 
ern robustness concept, but remained stuck with antiquated parametric su- 
permodels. Ad hoc supermodels and priors chosen by intuition (personal 
beliefs) or convenience (conjugate priors) do not guarantee robustness, for 
that one needs some theory. Compare already Hampel (1973). 


10 Future directions: small sample problems? 


At present, the most interesting and at the same time most promising new 
methods and open problems have to do with small sample situations. 
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The Princeton robustness study (Andrews et al., 1972) remains one of 
the best designed comparative Monte Carlo studies in statistics. Among 
other things we learned from it how difficult it is to compare the behavior 
of estimates across entire families of distributions, since small systematic 
differences of performance easily are swamped by larger random sampling 
errors. A very sophisticated sampling method thereafter proposed by Tukey 
is based on the remark that a given sample can occur under any strictly 
positive density, but it will do so with different probabilities. Thus it must 
be possible to compare those performances very efficiently by re-using the 
same sample configurations with different weights when forming Monte 
Carlo averages. A comprehensive account of this approach is given by 
Morgenthaler and Tukey (1991). 

On the other hand, it was noted by Hampel that a variant of the saddle 
point method can give fantastically accurate asymptotic approximations 
down to very small sample sizes (occasionally down to n = 1!). 

I hope that these approaches in the near future will permit to close sev- 
eral open problems in the area of small samples: studentization, confidence 
intervals, testing. Embarrassingly, the robustification of the statistics of 
two-way tables still is wide open. Typically there are so few degrees of free- 
dom per cell that ordinary asymptotic approaches are out of the question, 
maybe some version of small sample asymptotics may help here too. 
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Abstracts 


Asymptotic properties of Lı estimators in a multi-stage dose- 
response model: A Monte Carlo study 


Carmen D.S. André, University of Sao Paulo, Brazil 
Subhash C. Narula, Virginia Commonwealth University, Richmond, USA 


Abstract: The single-stage and two-stage dose response models are fre- 
quently used in practical applications. The maximum likelihood and the 
least squares principles are often used to estimate the unknown parameters 
of the model. It has been shown that these methods are sensitive to outliers 
in the data. The minimum sum of absolute errors MSAE (or Lı) criterion 
is more resistant to outliers than these popular procedures. However, at 
present not much is known about the statistical properties of the MSAE 
estimators of the parameters of the multistage dose-response model. In 
this paper, our objective is to study asymptotic properties and distribution 
of the MSAE estimators of the single-stage and two-stage dose-response 
models by simulation and to find the smallest sample size for which we 
may use the asymptotic distribution to draw statistical inferences about 
the parameters. We also give an approximate expression for the variance of 
these estimators when their asymptotic distribution follows a multinormal 
distribution. 


The L'-norm and interlaboratory tests 


P. L. Davies, University of Essen, Germany 


Abstract: The form of interlaboratory test we consider is that where 
each of J laboratories returns exactly one reading for each of J sanples. 
Such a test may be described by the random effects model 


Xij = Xitaj+téj, ALsisl, lsjsJ. 


The X; represent the laboratory effects, the a the sample conaminations 
and the €;; the measurement errors. The problem is to identify outlying 
observations and outlying laboratories. As we have only one observation 
per cell it is commonly believed that it is not possible to detect outliers or, 
equivalently, non-additivity. As shown in Terbeck and Davies (1996) this 
is not correct and so called unconditionally identifiable outlier patterns 
may be found by the L!- or an appropriate M-functional. The results 


of Terbeck and Davies are improved in certain respects and then applied 
to the random effects model. The method is applied to a real data set 
considered by Lischer (1993) which is concerned with the determination of 
lead in sewage sludge. 


Multivariate L; mean 


Yadolah Dodge and Valentin Rousson, 
University of Neuchatel, Switzerland 


Abstract: The center of a univariate data set {21,...,2,} can be de- 
fined as the point p that minimizes the norm of the vector of distances 
y = (|r, — ul, -|In — a|). As the median and the mean are the mini- 
mizers of respectively the Lı- and the Lo-norm of y, they are two alter- 
natives to describe the center of a univariate data set. The center p of a 
multivariate data set {x,,...,x,} can also be defined as minimizer of the 
norm of a vector of distances. In multivariate situations however, there 
are several kinds of distances. In this paper, we consider the vector of L4- 
distances y} = (||x, — elli; -~ ||X„ — f|1) amd the vector of L2-distances 
ys = (||x1 — ple, ..., ||Xn — ælļ|2). We define the L,-median and the L)-mean 
as the minimizers of respectively the Lı- and the Lo-norm of yı; and then 
the Lo-median and the Lo-mean as the minimizers of respectively the Lı- 
and the L2-norm of y2. In doing so, we obtain four alternatives to describe 
the center of a multivariate data set. While three of them have been already 
investigated in the statistical literature, the L;-mean appears to be a new 
concept. Contrary to the L;-median, the L,-mean is proved to be unique in 
almost all situations. In order to compare these multivariate medians and 
means, we use the rule of the net advantage coefficient introduced by Stavig 
and Gibbons (1977). A simulation study shows that the L,;-mean performs 
well, especially for data sets drawn from bivariate Laplace distribution. 


The information for the direction of dependence in 
Lı regression 


Yadolah Dodge, University of Neuchatel, Switzerland 
Joe Whittaker, Lancaster University, U.K. 


Abstract: An Lı regression model for a response variable X% is to sup- 
pose that the conditional distribution of X2 given X, is Laplace, and that 
the marginal distribution of the explanatory variable X, is also Laplace. 


We show that there is information to distinguish the direction of depen- 
dence X; and X9; or equivalently to distinguish between the models in 
which X, is dependent on X2, and X92 is dependent on X1. This is not true 
for Lo regression based on the Normal distribution. 


Dimension choice for sliced inverse regression based on ranks 


Louis Ferré, Université Paul Sabatier, Toulouse, France 


Abstract: Sliced Inverse Regression is a method for reducing the di- 
mensionality in multivariate non parametric regression problems. While 
the selection of the dimensionality has been investigated for the original 
version, no solution has been proposed for Hsing and Carroll (1992) ap- 
proach based on order statistics and associated concomitant variables. By 
using model selection approaches, we propose here two ways for selecting 
the dimensionality by estimating a loss function: first, a direct estimation 
is proposed and, then a Jack-Knifed estimate is investigated. Finally, the 
rank version is compared to classical SIR on a real life data set. 


[,-norm and Lo-norm methodology in cluster 
analysis 


Allan D. Gordon, University of St Andrews, North Haugh, Scotland 


Abstract: An overview is provided of the use of Lı-norm and L2-norm 
methodology in cluster analysis. Topics covered include dissimilarity mea- 
sures, partitions, fuzzy classifications, hierarchical classifications, and con- 
sensus classifications. 


Some issues in the applications of conditional 
quantile functions 


Xuming He, University of Illinois, USA 


Abstract: Conditional quantile functions are useful in a variety of appli- 
cations. Regression quantiles for linear models have been recently extended 
to semiparametric and nonparametric models. Further investigations are 
needed for both the statistical theory and computations. In this paper, 
I attempt to raise two questions that I believe are important to build a 


solid foundation for the applications of quantile regression. They focus on 
nearly extreme quantiles and the problem of crossing in estimated quantile 
functions. 


The median function on structured metric spaces 
F. R. McMorris, University of Louisville, USA 


Abstract: When (X,d) is a finite metric space and 7 = (z},...,2%) € 

XF, a median for 7 is a element x of X for which Y d(x, xi) is minimum. 
: =] 

The function that returns the set of all medians for any tuple 7 is called 
the median function on X. A brief survey is given of some of the results 
concerning the median function, starting with an arbitrary metric space and 
finishing with the case where X is a set of hypergraphs and d is the metric 
based on the Lı-norm. A simplistic maximum likelihood interpretation for 
the median function is also given. 


Least absolute value estimation of a linear 
functional model 


Edina Shisue Miazaki, Statistika Consultoria, Campinas, Brazil 
Gabriela Stangenhaus, Universidade de Brasilia, Brazil 


Abstract: This paper presents two robust Lı based estimators for the pa- 
rameters of a simple linear functional relationship (SLFR). The maximum 
likelihood estimation when the errors follow a double exponencial distribu- 
tion and the weighted Lı estimation are solved as non-linear optimization 
problems. The least median of squares estimates are proposed as starting 
values and the scale measures of the errors are based on the MAD. Both 
methods are resistant to outlying observations and the weighted Lı estima- 
tor is resistant to leverage points. Real examples illustrate the methods. 


ANOVA - models: A Bayesian analysis 
Wolgang Polasek and Shuangzhe Liu, University of Basel, Switzerland 


Abstract: The l-way and 2-way ANOVA are formulated as Bayesian lin- 
ear models with conjugate prior distributions. The classical case is treated 
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as a special one using matrix generalized (g-) inverses leading to so-called 
OLS- and OLS* estimates of the rank deficient ANOVA model. The 2- 
way ANOVA model without interactions can also be estimated in a 2-step 
procedure. 


Robustifying growth curve model estimation 


Gabriela Stangenhaus, Statistika Consultoria, Campinas, Brazil 
Elisete C. Quintaneiro Aubin, Universidade de Sao Paulo, Brazil 


Abstract: A robustified version of the parameter matrix estimators in 
the standard growth curve model obtained via the Potthoff-Roy transfor- 
mation is presented. The asymptotic distribution of the robust estimators is 
derived and the estimation of their variance-covariance matrix is discussed. 


Fitting Lə norm classification models to complex data sets 
Maurizio Vichi, University “G.DAnnunzio” di Chieti, Pescara, Italy 


Abstract: In this paper methodologies for fitting classification models 
(dendrograms and partitions) to two and three-way arrays of dissimilarities 
minimizing a Lə norm loss function are examined. A new algorithm for 
fitting several hierarchical classifications to quite large three-way arrays is 
also discussed. 


Applications of mathematical programming in 
L -estimation of nonlinear models 


Jinde Wang, Nanjing University, China 


Abstract: In this paper we review the results in Lı-estimation of nonlin- 
ear models obtained by applying mathematical programming techniques. 
We describe briefly the ways to find the asymptotic distribution, the ap- 
proximate representation and to treat dependent random error cases and 
inequality-constrained cases. With these results one can conclude that 
mathematical programming is a suitable tool for studying L-estimation 
problems. 
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Some contributions to M-estimation in regression models 


Lincheng Zhao, 
University of Science and Technology of China, Hefei, China 


Abstract: In this paper, we briefly survey some contributions to asymp- 
totic theory on M-estimation in a linear model and least absolute devia- 
tions (LAD) estimation in a censored regression model (known as the Tobit 
model), as well as on the relevant test criteria in ANOVA in the above mod- 
els. As a general approach on statistical data analysis, asymptotic theory 
of M-estimation in regression models has received extensive attention. In 
recent years, the author and some of his cooperators worked on this field 
and obtained some new results. In this paper we briefly introduce some of 
them and the related work in the literature. As a special case, the minimum 
L)-norm (MLN) estimation, also known as the least absolute deviations 
(LAD) estimation, plays an important role and is of special interest. Con- 
sidering this point, we will pay much attention to them as well. Our topics 
concern the usual linear model and a censored regression model, known as 
the Tobit model in econometric research. 
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